* Filesystem benchmarks on reasonably fast hardware
@ 2011-07-17 16:05 Jörn Engel
2011-07-17 23:32 ` Dave Chinner
` (2 more replies)
0 siblings, 3 replies; 21+ messages in thread
From: Jörn Engel @ 2011-07-17 16:05 UTC (permalink / raw)
To: linux-fsdevel
Hello everyone!
Recently I have had the pleasure of working with some nice hardware
and the displeasure of seeing it fail commercially. However, when
trying to optimize performance I noticed that in some cases the
bottlenecks were not in the hardware or my driver, but rather in the
filesystem on top of it. So maybe all this may still be useful in
improving said filesystem.
Hardware is basically a fast SSD. Performance tops out at about
650MB/s and is fairly insensitive to random access behaviour. Latency
is about 50us for 512B reads and near 0 for writes, through the usual
cheating.
Numbers below were created with sysbench, using directIO. Each block
is a matrix with results for blocksizes from 512B to 16384B and thread
count from 1 to 128. Four blocks for reads and writes, both
sequential and random.
Ext4:
=====
seqrd 1 2 4 8 16 32 64 128
16384 4867 8717 16367 29249 39131 39140 39135 39123
8192 6324 10889 19980 37239 66346 78444 78429 78409
4096 9158 15810 26072 45999 85371 148061 157222 157294
2048 15019 24555 35934 59698 106541 198986 313969 315566
1024 24271 36914 51845 80230 136313 252832 454153 484120
512 37803 62144 78952 111489 177844 314896 559295 615744
rndrd 1 2 4 8 16 32 64 128
16384 4770 8539 14715 23465 33630 39073 39101 39103
8192 6138 11398 20873 35785 56068 75743 78364 78374
4096 8338 15657 29648 53927 91854 136595 157279 157349
2048 11985 22894 43495 81211 148029 239962 314183 315695
1024 16529 31306 61307 114721 222700 387439 561810 632719
512 20580 40160 73642 135711 294583 542828 795607 821025
seqwr 1 2 4 8 16 32 64 128
16384 37588 37600 37730 37680 37631 37664 37670 37662
8192 77621 77737 77947 77967 77875 77939 77833 77574
4096 124083 123171 121159 120947 120202 120315 119917 120236
2048 158540 153993 151128 150663 150686 151159 150358 147827
1024 183300 176091 170527 170919 169608 169900 169662 168622
512 229167 231672 221629 220416 223490 217877 222390 219718
rndwr 1 2 4 8 16 32 64 128
16384 38932 38290 38200 38306 38421 38404 38329 38326
8192 79790 77297 77464 77447 77420 77460 77495 77545
4096 163985 157626 158232 158212 158102 158169 158273 158236
2048 272261 322637 320032 320932 321597 322008 322242 322699
1024 339647 609192 652655 644903 654604 658292 659119 659667
512 403366 718426 1227643 1149850 1155541 1157633 1173567 1180710
Sequestial writes are significantly worse than random writes. If
someone is interested, I can see which lock is causing all this.
Sequential reads below 2k are also worse, although one might wonder
whether direct IO on 1k chunks makes sense at all. Random reads in
the last column scale very nicely with block size down to 1k, but hit
some problem at 512B. The machine could be cpu-bound at this point.
Btrfs:
======
seqrd 1 2 4 8 16 32 64 128
16384 3270 6582 12919 24866 36424 39682 39726 39721
8192 4394 8348 16483 32165 54221 79256 79396 79415
4096 6337 12024 21696 40569 74924 131763 158292 158763
2048 297222 298299 294727 294740 296496 298517 300118 300740
1024 583891 595083 584272 580965 584030 589115 599634 598054
512 1103026 1175523 1134172 1133606 1123684 1123978 1156758 1130354
rndrd 1 2 4 8 16 32 64 128
16384 3252 6621 12437 20354 30896 39365 39115 39746
8192 4273 8749 17871 32135 51812 72715 79443 79456
4096 5842 11900 24824 48072 84485 128721 158631 158812
2048 7177 12540 20244 27543 32386 34839 35728 35916
1024 7178 12577 20341 27473 32656 34763 36056 35960
512 7176 12554 20289 27603 32504 34781 35983 35919
seqwr 1 2 4 8 16 32 64 128
16384 13357 12838 12604 12596 12588 12641 12716 12814
8192 21426 20471 20090 20097 20287 20236 20445 20528
4096 30740 29187 28528 28525 28576 28580 28883 29258
2048 2949 3214 3360 3431 3440 3498 3396 3498
1024 2167 2205 2412 2376 2473 2221 2410 2420
512 1888 1876 1926 1981 1935 1938 1957 1976
rndwr 1 2 4 8 16 32 64 128
16384 10985 19312 27430 27813 28157 28528 28308 28234
8192 16505 29420 35329 34925 36020 34976 35897 35174
4096 21894 31724 34106 34799 36119 36608 37571 36274
2048 3637 8031 15225 22599 30882 31966 32567 32427
1024 3704 8121 15219 23670 31784 33156 31469 33547
512 3604 7988 15206 23742 32007 31933 32523 33667
Sequential writes below 4k perform drastically worse. Quite
unexpected. Write performance across the board is horrible when
compared to ext4. Sequential reads are much better, in particular for
<4k cases. I would assume some sort of readahead is happening.
Random reads <4k again drop off significantly.
xfs:
====
seqrd 1 2 4 8 16 32 64 128
16384 4698 4424 4397 4402 4394 4398 4642 4679
8192 6234 5827 5797 5801 5795 6114 5793 5812
4096 9100 8835 8882 8896 8874 8890 8910 8906
2048 14922 14391 14259 14248 14264 14264 14269 14273
1024 23853 22690 22329 22362 22338 22277 22240 22301
512 37353 33990 33292 33332 33306 33296 33224 33271
rndrd 1 2 4 8 16 32 64 128
16384 4585 8248 14219 22533 32020 38636 39033 39054
8192 6032 11186 20294 34443 53112 71228 78197 78284
4096 8247 15539 29046 52090 86744 125835 154031 157143
2048 11950 22652 42719 79562 140133 218092 286111 314870
1024 16526 31294 59761 112494 207848 348226 483972 574403
512 20635 39755 73010 130992 270648 484406 686190 726615
seqwr 1 2 4 8 16 32 64 128
16384 39956 39695 39971 39913 37042 37538 36591 32179
8192 67934 66073 30963 29038 29852 25210 23983 28272
4096 89250 81417 28671 18685 12917 14870 22643 22237
2048 140272 120588 140665 140012 137516 139183 131330 129684
1024 217473 147899 210350 218526 219867 220120 219758 215166
512 328260 181197 211131 263533 294009 298203 301698 298013
rndwr 1 2 4 8 16 32 64 128
16384 38447 38153 38145 38140 38156 38199 38208 38236
8192 78001 76965 76908 76945 77023 77174 77166 77106
4096 160721 156000 157196 157084 157078 157123 156978 157149
2048 325395 317148 317858 318442 318750 318981 319798 320393
1024 434084 649814 650176 651820 653928 654223 655650 655818
512 501067 876555 1290292 1217671 1244399 1267729 1285469 1298522
Sequential reads are pretty horrible. Sequential writes are hitting a
hot lock again.
So, if anyone would like to improve one of these filesystems and needs
more data, feel free to ping me.
Jörn
--
Victory in war is not repetitious.
-- Sun Tzu
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Filesystem benchmarks on reasonably fast hardware
2011-07-17 16:05 Filesystem benchmarks on reasonably fast hardware Jörn Engel
@ 2011-07-17 23:32 ` Dave Chinner
[not found] ` <20110718075339.GB1437@logfs.org>
2011-07-18 12:07 ` Ted Ts'o
2011-07-19 13:19 ` Dave Chinner
2 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2011-07-17 23:32 UTC (permalink / raw)
To: Jörn Engel; +Cc: linux-fsdevel
On Sun, Jul 17, 2011 at 06:05:01PM +0200, Jörn Engel wrote:
> Hello everyone!
>
> Recently I have had the pleasure of working with some nice hardware
> and the displeasure of seeing it fail commercially. However, when
> trying to optimize performance I noticed that in some cases the
> bottlenecks were not in the hardware or my driver, but rather in the
> filesystem on top of it. So maybe all this may still be useful in
> improving said filesystem.
>
> Hardware is basically a fast SSD. Performance tops out at about
> 650MB/s and is fairly insensitive to random access behaviour. Latency
> is about 50us for 512B reads and near 0 for writes, through the usual
> cheating.
>
> Numbers below were created with sysbench, using directIO. Each block
> is a matrix with results for blocksizes from 512B to 16384B and thread
> count from 1 to 128. Four blocks for reads and writes, both
> sequential and random.
What's the command line/script used to generate the result matrix?
And what kernel are you running on?
> xfs:
> ====
> seqrd 1 2 4 8 16 32 64 128
> 16384 4698 4424 4397 4402 4394 4398 4642 4679
> 8192 6234 5827 5797 5801 5795 6114 5793 5812
> 4096 9100 8835 8882 8896 8874 8890 8910 8906
> 2048 14922 14391 14259 14248 14264 14264 14269 14273
> 1024 23853 22690 22329 22362 22338 22277 22240 22301
> 512 37353 33990 33292 33332 33306 33296 33224 33271
Something is single threading completely there - something is very
wrong. Someone want to send me a nice fast pci-e SSD - my disks
don't spin that fast... :/
> rndrd 1 2 4 8 16 32 64 128
> 16384 4585 8248 14219 22533 32020 38636 39033 39054
> 8192 6032 11186 20294 34443 53112 71228 78197 78284
> 4096 8247 15539 29046 52090 86744 125835 154031 157143
> 2048 11950 22652 42719 79562 140133 218092 286111 314870
> 1024 16526 31294 59761 112494 207848 348226 483972 574403
> 512 20635 39755 73010 130992 270648 484406 686190 726615
>
> seqwr 1 2 4 8 16 32 64 128
> 16384 39956 39695 39971 39913 37042 37538 36591 32179
> 8192 67934 66073 30963 29038 29852 25210 23983 28272
> 4096 89250 81417 28671 18685 12917 14870 22643 22237
> 2048 140272 120588 140665 140012 137516 139183 131330 129684
> 1024 217473 147899 210350 218526 219867 220120 219758 215166
> 512 328260 181197 211131 263533 294009 298203 301698 298013
>
> rndwr 1 2 4 8 16 32 64 128
> 16384 38447 38153 38145 38140 38156 38199 38208 38236
> 8192 78001 76965 76908 76945 77023 77174 77166 77106
> 4096 160721 156000 157196 157084 157078 157123 156978 157149
> 2048 325395 317148 317858 318442 318750 318981 319798 320393
> 1024 434084 649814 650176 651820 653928 654223 655650 655818
> 512 501067 876555 1290292 1217671 1244399 1267729 1285469 1298522
I'm assuming that is the h/w can do 650MB/s then the numbers are in
iops? from 4 threads up all results equate to 650MB/s.
> Sequential reads are pretty horrible. Sequential writes are hitting a
> hot lock again.
lockstat output?
> So, if anyone would like to improve one of these filesystems and needs
> more data, feel free to ping me.
Of course I'm interested. ;)
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Filesystem benchmarks on reasonably fast hardware
[not found] ` <20110718075339.GB1437@logfs.org>
@ 2011-07-18 10:57 ` Dave Chinner
2011-07-18 11:40 ` Jörn Engel
2011-07-18 14:34 ` Jörn Engel
[not found] ` <20110718103956.GE1437@logfs.org>
1 sibling, 2 replies; 21+ messages in thread
From: Dave Chinner @ 2011-07-18 10:57 UTC (permalink / raw)
To: Jörn Engel; +Cc: linux-fsdevel
On Mon, Jul 18, 2011 at 09:53:39AM +0200, Jörn Engel wrote:
> On Mon, 18 July 2011 09:32:52 +1000, Dave Chinner wrote:
> > On Sun, Jul 17, 2011 at 06:05:01PM +0200, Jörn Engel wrote:
> > >
> > > Numbers below were created with sysbench, using directIO. Each block
> > > is a matrix with results for blocksizes from 512B to 16384B and thread
> > > count from 1 to 128. Four blocks for reads and writes, both
> > > sequential and random.
> >
> > What's the command line/script used to generate the result matrix?
> > And what kernel are you running on?
>
> Script is attached. Kernel is git from July 13th (51414d41).
Ok, thanks.
> > > xfs:
> > > ====
> > > seqrd 1 2 4 8 16 32 64 128
> > > 16384 4698 4424 4397 4402 4394 4398 4642 4679
> > > 8192 6234 5827 5797 5801 5795 6114 5793 5812
> > > 4096 9100 8835 8882 8896 8874 8890 8910 8906
> > > 2048 14922 14391 14259 14248 14264 14264 14269 14273
> > > 1024 23853 22690 22329 22362 22338 22277 22240 22301
> > > 512 37353 33990 33292 33332 33306 33296 33224 33271
> >
> > Something is single threading completely there - something is very
> > wrong. Someone want to send me a nice fast pci-e SSD - my disks
> > don't spin that fast... :/
>
> I wish I could just go down the shop and pick one from the
> manufacturing line. :/
Heh. At this point any old pci-e ssd would be an improvement ;)
> > > rndwr 1 2 4 8 16 32 64 128
> > > 16384 38447 38153 38145 38140 38156 38199 38208 38236
> > > 8192 78001 76965 76908 76945 77023 77174 77166 77106
> > > 4096 160721 156000 157196 157084 157078 157123 156978 157149
> > > 2048 325395 317148 317858 318442 318750 318981 319798 320393
> > > 1024 434084 649814 650176 651820 653928 654223 655650 655818
> > > 512 501067 876555 1290292 1217671 1244399 1267729 1285469 1298522
> >
> > I'm assuming that is the h/w can do 650MB/s then the numbers are in
> > iops? from 4 threads up all results equate to 650MB/s.
>
> Correct. Writes are spread automatically across all chips. They are
> further cached, so until every chip is busy writing, their effective
> latency is pretty much 0. Makes for a pretty flat graph, I agree.
>
> > > Sequential reads are pretty horrible. Sequential writes are hitting a
> > > hot lock again.
> >
> > lockstat output?
>
> Attached for the bottom right case each of seqrd and seqwr. I hope
> the filenames are descriptive enough.
Looks like you attached the seqrd lockstat twice.
> Lockstat itself hurts
> performance. Writes were at 32245 IO/s from 298013, reads at 22458
> IO/s from 33271. In a way we are measuring oranges to figure out why
> our apples are so small.
Yeah, but at least it points out the lock in question - the iolock.
We grab it exclusively for a very short period of time on each
direct IO read to check the page cache state, then demote it to
shared. I can see that when IO times are very short, this will, in
fact, serialise multiple readers to a single file.
A single thread shows this locking pattern:
sysbench-3087 [000] 2192558.643146: xfs_ilock: dev 253:0 ino 0x83 flags IOLOCK_EXCL caller xfs_rw_ilock
sysbench-3087 [000] 2192558.643147: xfs_ilock_demote: dev 253:0 ino 0x83 flags IOLOCK_EXCL caller T.1428
sysbench-3087 [000] 2192558.643150: xfs_ilock: dev 253:0 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_map_shared
sysbench-3087 [001] 2192558.643877: xfs_ilock: dev 253:0 ino 0x83 flags IOLOCK_EXCL caller xfs_rw_ilock
sysbench-3087 [001] 2192558.643879: xfs_ilock_demote: dev 253:0 ino 0x83 flags IOLOCK_EXCL caller T.1428
sysbench-3087 [007] 2192558.643881: xfs_ilock: dev 253:0 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_map_shared
Two threads show this:
sysbench-3096 [005] 2192697.678308: xfs_ilock: dev 253:0 ino 0x1c02c2 flags IOLOCK_EXCL caller xfs_rw_ilock
sysbench-3096 [005] 2192697.678314: xfs_ilock_demote: dev 253:0 ino 0x1c02c2 flags IOLOCK_EXCL caller T.1428
sysbench-3096 [005] 2192697.678335: xfs_ilock: dev 253:0 ino 0x1c02c2 flags ILOCK_SHARED caller xfs_ilock_map_shared
sysbench-3097 [006] 2192697.678556: xfs_ilock: dev 253:0 ino 0x1c02c2 flags IOLOCK_EXCL caller xfs_rw_ilock
sysbench-3097 [006] 2192697.678556: xfs_ilock_demote: dev 253:0 ino 0x1c02c2 flags IOLOCK_EXCL caller T.1428
sysbench-3097 [006] 2192697.678577: xfs_ilock: dev 253:0 ino 0x1c02c2 flags ILOCK_SHARED caller xfs_ilock_map_shared
sysbench-3096 [007] 2192697.678976: xfs_ilock: dev 253:0 ino 0x1c02c2 flags IOLOCK_EXCL caller xfs_rw_ilock
sysbench-3096 [007] 2192697.678978: xfs_ilock_demote: dev 253:0 ino 0x1c02c2 flags IOLOCK_EXCL caller T.1428
sysbench-3096 [007] 2192697.679000: xfs_ilock: dev 253:0 ino 0x1c02c2 flags ILOCK_SHARED caller xfs_ilock_map_shared
Which shows the exclusive lock on the concurrent IO serialising on
the IO in progress. Oops, that's not good.
Ok, the patch below takes the numbers on my test setup on a 16k IO
size:
seqrd 1 2 4 8 16
vanilla 3603 2798 2563 not tested...
patches 3707 5746 10304 12875 11016
So those numbers look a lot healthier. The patch is below,
> --
> Fancy algorithms are slow when n is small, and n is usually small.
> Fancy algorithms have big constants. Until you know that n is
> frequently going to be big, don't get fancy.
> -- Rob Pike
Heh. XFS always assumes n will be big. Because where XFS is used, it
just is.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
xfs: don't serialise direct IO reads on page cache checks
From: Dave Chinner <dchinner@redhat.com>
There is no need to grab the i_mutex of the IO lock in exclusive
mode if we don't need to invalidate the page cache. Taking hese
locks on every direct IO effective serialisaes them as taking the IO
lock in exclusive mode has to wait for all shared holders to drop
the lock. That only happens when IO is complete, so effective it
prevents dispatch of concurrent direct IO reads to the same inode.
Fix this by taking the IO lock shared to check the page cache state,
and only then drop it and take the IO lock exclusively if there is
work to be done. Hence for the normal direct IO case, no exclusive
locking will occur.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/linux-2.6/xfs_file.c | 17 ++++++++++++++---
1 files changed, 14 insertions(+), 3 deletions(-)
diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c
index 1e641e6..16a4bf0 100644
--- a/fs/xfs/linux-2.6/xfs_file.c
+++ b/fs/xfs/linux-2.6/xfs_file.c
@@ -321,7 +321,19 @@ xfs_file_aio_read(
if (XFS_FORCED_SHUTDOWN(mp))
return -EIO;
- if (unlikely(ioflags & IO_ISDIRECT)) {
+ /*
+ * Locking is a bit tricky here. If we take an exclusive lock
+ * for direct IO, we effectively serialise all new concurrent
+ * read IO to this file and block it behind IO that is currently in
+ * progress because IO in progress holds the IO lock shared. We only
+ * need to hold the lock exclusive to blow away the page cache, so
+ * only take lock exclusively if the page cache needs invalidation.
+ * This allows the normal direct IO case of no page cache pages to
+ * proceeed concurrently without serialisation.
+ */
+ xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
+ if ((ioflags & IO_ISDIRECT) && inode->i_mapping->nrpages) {
+ xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
xfs_rw_ilock(ip, XFS_IOLOCK_EXCL);
if (inode->i_mapping->nrpages) {
@@ -334,8 +346,7 @@ xfs_file_aio_read(
}
}
xfs_rw_ilock_demote(ip, XFS_IOLOCK_EXCL);
- } else
- xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
+ }
trace_xfs_file_read(ip, size, iocb->ki_pos, ioflags);
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: Filesystem benchmarks on reasonably fast hardware
[not found] ` <20110718103956.GE1437@logfs.org>
@ 2011-07-18 11:10 ` Dave Chinner
0 siblings, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2011-07-18 11:10 UTC (permalink / raw)
To: Jörn Engel; +Cc: linux-fsdevel
On Mon, Jul 18, 2011 at 12:39:56PM +0200, Jörn Engel wrote:
> Write lockstat (I mistakenly sent the read one twice).
Yeah, that's the i_mutex that is the issue there. We are definitely
taking exclusive locks during the IO submission process there.
I suspect I might be able to write a patch that does all the checks
under a shared lock - similar to the patch for the read side - but
it is definitely more complex and I'll have to have a bit of a think
about it.
Thanks for the bug report!
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Filesystem benchmarks on reasonably fast hardware
2011-07-18 10:57 ` Dave Chinner
@ 2011-07-18 11:40 ` Jörn Engel
2011-07-19 2:41 ` Dave Chinner
2011-07-18 14:34 ` Jörn Engel
1 sibling, 1 reply; 21+ messages in thread
From: Jörn Engel @ 2011-07-18 11:40 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-fsdevel
On Mon, 18 July 2011 20:57:49 +1000, Dave Chinner wrote:
> On Mon, Jul 18, 2011 at 09:53:39AM +0200, Jörn Engel wrote:
> > On Mon, 18 July 2011 09:32:52 +1000, Dave Chinner wrote:
> > > On Sun, Jul 17, 2011 at 06:05:01PM +0200, Jörn Engel wrote:
>
> > > > xfs:
> > > > ====
> > > > seqrd 1 2 4 8 16 32 64 128
> > > > 16384 4698 4424 4397 4402 4394 4398 4642 4679
> > > > 8192 6234 5827 5797 5801 5795 6114 5793 5812
> > > > 4096 9100 8835 8882 8896 8874 8890 8910 8906
> > > > 2048 14922 14391 14259 14248 14264 14264 14269 14273
> > > > 1024 23853 22690 22329 22362 22338 22277 22240 22301
> > > > 512 37353 33990 33292 33332 33306 33296 33224 33271
Your patch definitely helps. Bottom right number is 584741 now.
Still slower than ext4 or btrfs, but in the right ballpark. Will
post the entire block once it has been generated.
Jörn
--
Data dominates. If you've chosen the right data structures and organized
things well, the algorithms will almost always be self-evident. Data
structures, not algorithms, are central to programming.
-- Rob Pike
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Filesystem benchmarks on reasonably fast hardware
2011-07-17 16:05 Filesystem benchmarks on reasonably fast hardware Jörn Engel
2011-07-17 23:32 ` Dave Chinner
@ 2011-07-18 12:07 ` Ted Ts'o
2011-07-18 12:42 ` Jörn Engel
2011-07-19 13:19 ` Dave Chinner
2 siblings, 1 reply; 21+ messages in thread
From: Ted Ts'o @ 2011-07-18 12:07 UTC (permalink / raw)
To: Jörn Engel; +Cc: linux-fsdevel
Hey Jörn,
Can you send me your script and the lockstat for ext4?
(Please cc the linux-ext4@vger.kernel.org list if you don't mind.
Thanks!!)
Thanks,
- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Filesystem benchmarks on reasonably fast hardware
2011-07-18 12:07 ` Ted Ts'o
@ 2011-07-18 12:42 ` Jörn Engel
2011-07-25 15:18 ` Ted Ts'o
0 siblings, 1 reply; 21+ messages in thread
From: Jörn Engel @ 2011-07-18 12:42 UTC (permalink / raw)
To: Ted Ts'o; +Cc: linux-fsdevel, linux-ext4
[-- Attachment #1: Type: text/plain, Size: 666 bytes --]
On Mon, 18 July 2011 08:07:51 -0400, Ted Ts'o wrote:
>
> Can you send me your script and the lockstat for ext4?
Attached. The first script generates a bunch of files, the second
condenses them into the tabular form. Will need some massaging to
work on anything other than my particular setup, sorry.
> (Please cc the linux-ext4@vger.kernel.org list if you don't mind.
> Thanks!!)
Sure. Lockstat will come later today. The machine is currently busy
regenerating xfs seqrd numbers.
Jörn
--
I've never met a human being who would want to read 17,000 pages of
documentation, and if there was, I'd kill him to get him out of the
gene pool.
-- Joseph Costello
[-- Attachment #2: sysbench.sh --]
[-- Type: application/x-sh, Size: 1612 bytes --]
[-- Attachment #3: sysbench_result.sh --]
[-- Type: application/x-sh, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Filesystem benchmarks on reasonably fast hardware
2011-07-18 10:57 ` Dave Chinner
2011-07-18 11:40 ` Jörn Engel
@ 2011-07-18 14:34 ` Jörn Engel
1 sibling, 0 replies; 21+ messages in thread
From: Jörn Engel @ 2011-07-18 14:34 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-fsdevel
On Mon, 18 July 2011 20:57:49 +1000, Dave Chinner wrote:
> On Mon, Jul 18, 2011 at 09:53:39AM +0200, Jörn Engel wrote:
> > On Mon, 18 July 2011 09:32:52 +1000, Dave Chinner wrote:
> > > On Sun, Jul 17, 2011 at 06:05:01PM +0200, Jörn Engel wrote:
>
> > > > xfs:
> > > > ====
> > > > seqrd 1 2 4 8 16 32 64 128
> > > > 16384 4698 4424 4397 4402 4394 4398 4642 4679
> > > > 8192 6234 5827 5797 5801 5795 6114 5793 5812
> > > > 4096 9100 8835 8882 8896 8874 8890 8910 8906
> > > > 2048 14922 14391 14259 14248 14264 14264 14269 14273
> > > > 1024 23853 22690 22329 22362 22338 22277 22240 22301
> > > > 512 37353 33990 33292 33332 33306 33296 33224 33271
seqrd 1 2 4 8 16 32 64 128
16384 4542 8311 15738 28955 38273 36644 38530 38527
8192 6000 10413 19208 33878 65927 76906 77083 77102
4096 8931 14971 24794 44223 83512 144867 147581 150702
2048 14375 23489 34364 56887 103053 192662 307167 309222
1024 21647 36022 49649 77163 132886 243296 421389 497581
512 31832 61257 79545 108782 176341 303836 517814 584741
Quite a nice improvement for such a small patch. As they say, "every
small factor of 17 helps". ;)
What bothers me a bit is that the single-threaded numbers took such a
noticeable hit...
> Ok, the patch below takes the numbers on my test setup on a 16k IO
> size:
>
> seqrd 1 2 4 8 16
> vanilla 3603 2798 2563 not tested...
> patches 3707 5746 10304 12875 11016
...in particular when your numbers improve even for a single thread.
Wonder what's going on here.
Anyway, feel free to add a Tested-By: or something from me. And maybe
fix the two typos below.
> xfs: don't serialise direct IO reads on page cache checks
>
> From: Dave Chinner <dchinner@redhat.com>
>
> There is no need to grab the i_mutex of the IO lock in exclusive
> mode if we don't need to invalidate the page cache. Taking hese
^
> locks on every direct IO effective serialisaes them as taking the IO
^
> lock in exclusive mode has to wait for all shared holders to drop
> the lock. That only happens when IO is complete, so effective it
> prevents dispatch of concurrent direct IO reads to the same inode.
>
> Fix this by taking the IO lock shared to check the page cache state,
> and only then drop it and take the IO lock exclusively if there is
> work to be done. Hence for the normal direct IO case, no exclusive
> locking will occur.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> fs/xfs/linux-2.6/xfs_file.c | 17 ++++++++++++++---
> 1 files changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c
> index 1e641e6..16a4bf0 100644
> --- a/fs/xfs/linux-2.6/xfs_file.c
> +++ b/fs/xfs/linux-2.6/xfs_file.c
> @@ -321,7 +321,19 @@ xfs_file_aio_read(
> if (XFS_FORCED_SHUTDOWN(mp))
> return -EIO;
>
> - if (unlikely(ioflags & IO_ISDIRECT)) {
> + /*
> + * Locking is a bit tricky here. If we take an exclusive lock
> + * for direct IO, we effectively serialise all new concurrent
> + * read IO to this file and block it behind IO that is currently in
> + * progress because IO in progress holds the IO lock shared. We only
> + * need to hold the lock exclusive to blow away the page cache, so
> + * only take lock exclusively if the page cache needs invalidation.
> + * This allows the normal direct IO case of no page cache pages to
> + * proceeed concurrently without serialisation.
> + */
> + xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
> + if ((ioflags & IO_ISDIRECT) && inode->i_mapping->nrpages) {
> + xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
> xfs_rw_ilock(ip, XFS_IOLOCK_EXCL);
>
> if (inode->i_mapping->nrpages) {
> @@ -334,8 +346,7 @@ xfs_file_aio_read(
> }
> }
> xfs_rw_ilock_demote(ip, XFS_IOLOCK_EXCL);
> - } else
> - xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
> + }
>
> trace_xfs_file_read(ip, size, iocb->ki_pos, ioflags);
>
Jörn
--
Everything should be made as simple as possible, but not simpler.
-- Albert Einstein
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Filesystem benchmarks on reasonably fast hardware
2011-07-18 11:40 ` Jörn Engel
@ 2011-07-19 2:41 ` Dave Chinner
2011-07-19 7:36 ` Jörn Engel
0 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2011-07-19 2:41 UTC (permalink / raw)
To: Jörn Engel; +Cc: linux-fsdevel
On Mon, Jul 18, 2011 at 01:40:36PM +0200, Jörn Engel wrote:
> On Mon, 18 July 2011 20:57:49 +1000, Dave Chinner wrote:
> > On Mon, Jul 18, 2011 at 09:53:39AM +0200, Jörn Engel wrote:
> > > On Mon, 18 July 2011 09:32:52 +1000, Dave Chinner wrote:
> > > > On Sun, Jul 17, 2011 at 06:05:01PM +0200, Jörn Engel wrote:
> >
> > > > > xfs:
> > > > > ====
> > > > > seqrd 1 2 4 8 16 32 64 128
> > > > > 16384 4698 4424 4397 4402 4394 4398 4642 4679
> > > > > 8192 6234 5827 5797 5801 5795 6114 5793 5812
> > > > > 4096 9100 8835 8882 8896 8874 8890 8910 8906
> > > > > 2048 14922 14391 14259 14248 14264 14264 14269 14273
> > > > > 1024 23853 22690 22329 22362 22338 22277 22240 22301
> > > > > 512 37353 33990 33292 33332 33306 33296 33224 33271
>
> Your patch definitely helps. Bottom right number is 584741 now.
> Still slower than ext4 or btrfs, but in the right ballpark. Will
> post the entire block once it has been generated.
The btrfs numbers are through doing different IO. have a look at all
the sub-filesystem block size numbers for btrfs. No matter the
thread count, the number is the same - hardware limits. btrfs is not
doing an IO per read syscall there - I'd say it's falling back to
buffered IO unlink ext4 and xfs....
.....
> seqrd 1 2 4 8 16 32 64 128
> 16384 4542 8311 15738 28955 38273 36644 38530 38527
> 8192 6000 10413 19208 33878 65927 76906 77083 77102
> 4096 8931 14971 24794 44223 83512 144867 147581 150702
> 2048 14375 23489 34364 56887 103053 192662 307167 309222
> 1024 21647 36022 49649 77163 132886 243296 421389 497581
> 512 31832 61257 79545 108782 176341 303836 517814 584741
>
> Quite a nice improvement for such a small patch. As they say, "every
> small factor of 17 helps". ;)
And in general the numbers are within a couple of percent of the
ext4 numbers, which is probably a reflection of the slightly higher
CPU cost of the XFS read path compared to ext4.
> What bothers me a bit is that the single-threaded numbers took such a
> noticeable hit...
Is it reproducable? I did notice quite a bit of run-to-run variation
in the numbers I ran. For single threaded numbers, they appear to be
in the order of +/-100 ops @ 16k block size.
>
> > Ok, the patch below takes the numbers on my test setup on a 16k IO
> > size:
> >
> > seqrd 1 2 4 8 16
> > vanilla 3603 2798 2563 not tested...
> > patches 3707 5746 10304 12875 11016
>
> ...in particular when your numbers improve even for a single thread.
> Wonder what's going on here.
And these were just quoted from a single test run.
> Anyway, feel free to add a Tested-By: or something from me. And maybe
> fix the two typos below.
Will do.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Filesystem benchmarks on reasonably fast hardware
2011-07-19 2:41 ` Dave Chinner
@ 2011-07-19 7:36 ` Jörn Engel
2011-07-19 9:23 ` srimugunthan dhandapani
2011-07-19 10:15 ` Dave Chinner
0 siblings, 2 replies; 21+ messages in thread
From: Jörn Engel @ 2011-07-19 7:36 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-fsdevel
On Tue, 19 July 2011 12:41:38 +1000, Dave Chinner wrote:
> On Mon, Jul 18, 2011 at 01:40:36PM +0200, Jörn Engel wrote:
> > On Mon, 18 July 2011 20:57:49 +1000, Dave Chinner wrote:
> > > On Mon, Jul 18, 2011 at 09:53:39AM +0200, Jörn Engel wrote:
> > > > On Mon, 18 July 2011 09:32:52 +1000, Dave Chinner wrote:
> > > > > On Sun, Jul 17, 2011 at 06:05:01PM +0200, Jörn Engel wrote:
> > >
> > > > > > xfs:
> > > > > > ====
> > > > > > seqrd 1 2 4 8 16 32 64 128
> > > > > > 16384 4698 4424 4397 4402 4394 4398 4642 4679
> > > > > > 8192 6234 5827 5797 5801 5795 6114 5793 5812
> > > > > > 4096 9100 8835 8882 8896 8874 8890 8910 8906
> > > > > > 2048 14922 14391 14259 14248 14264 14264 14269 14273
> > > > > > 1024 23853 22690 22329 22362 22338 22277 22240 22301
> > > > > > 512 37353 33990 33292 33332 33306 33296 33224 33271
>
> > seqrd 1 2 4 8 16 32 64 128
> > 16384 4542 8311 15738 28955 38273 36644 38530 38527
> > 8192 6000 10413 19208 33878 65927 76906 77083 77102
> > 4096 8931 14971 24794 44223 83512 144867 147581 150702
> > 2048 14375 23489 34364 56887 103053 192662 307167 309222
> > 1024 21647 36022 49649 77163 132886 243296 421389 497581
> > 512 31832 61257 79545 108782 176341 303836 517814 584741
>
> > What bothers me a bit is that the single-threaded numbers took such a
> > noticeable hit...
>
> Is it reproducable? I did notice quite a bit of run-to-run variation
> in the numbers I ran. For single threaded numbers, they appear to be
> in the order of +/-100 ops @ 16k block size.
Ime the numbers are stable within about 10%. And given that out of
six measurements every single one is a regression, I would feel
confident to bet a beverage without further measurements. Regression
is 3.4%, 3.9%, 1.9%, 3.8%, 10% and 17% respectively, so the effect
appears to be more visible with smaller block numbers as well.
Jörn
--
Schrödinger's cat is <BLINK>not</BLINK> dead.
-- Illiad
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Filesystem benchmarks on reasonably fast hardware
2011-07-19 7:36 ` Jörn Engel
@ 2011-07-19 9:23 ` srimugunthan dhandapani
2011-07-21 19:05 ` Jörn Engel
2011-07-19 10:15 ` Dave Chinner
1 sibling, 1 reply; 21+ messages in thread
From: srimugunthan dhandapani @ 2011-07-19 9:23 UTC (permalink / raw)
To: Jörn Engel; +Cc: linux-fsdevel
On Tue, Jul 19, 2011 at 1:06 PM, Jörn Engel <joern@logfs.org> wrote:
> On Tue, 19 July 2011 12:41:38 +1000, Dave Chinner wrote:
>> On Mon, Jul 18, 2011 at 01:40:36PM +0200, Jörn Engel wrote:
>> > On Mon, 18 July 2011 20:57:49 +1000, Dave Chinner wrote:
>> > > On Mon, Jul 18, 2011 at 09:53:39AM +0200, Jörn Engel wrote:
>> > > > On Mon, 18 July 2011 09:32:52 +1000, Dave Chinner wrote:
>> > > > > On Sun, Jul 17, 2011 at 06:05:01PM +0200, Jörn Engel wrote:
>> > >
>> > > > > > xfs:
>> > > > > > ====
>> > > > > > seqrd 1 2 4 8 16 32 64 128
>> > > > > > 16384 4698 4424 4397 4402 4394 4398 4642 4679
>> > > > > > 8192 6234 5827 5797 5801 5795 6114 5793 5812
>> > > > > > 4096 9100 8835 8882 8896 8874 8890 8910 8906
>> > > > > > 2048 14922 14391 14259 14248 14264 14264 14269 14273
>> > > > > > 1024 23853 22690 22329 22362 22338 22277 22240 22301
>> > > > > > 512 37353 33990 33292 33332 33306 33296 33224 33271
>>
>> > seqrd 1 2 4 8 16 32 64 128
>> > 16384 4542 8311 15738 28955 38273 36644 38530 38527
>> > 8192 6000 10413 19208 33878 65927 76906 77083 77102
>> > 4096 8931 14971 24794 44223 83512 144867 147581 150702
>> > 2048 14375 23489 34364 56887 103053 192662 307167 309222
>> > 1024 21647 36022 49649 77163 132886 243296 421389 497581
>> > 512 31832 61257 79545 108782 176341 303836 517814 584741
>>
>> > What bothers me a bit is that the single-threaded numbers took such a
>> > noticeable hit...
>>
>> Is it reproducable? I did notice quite a bit of run-to-run variation
>> in the numbers I ran. For single threaded numbers, they appear to be
>> in the order of +/-100 ops @ 16k block size.
>
> Ime the numbers are stable within about 10%. And given that out of
> six measurements every single one is a regression, I would feel
> confident to bet a beverage without further measurements. Regression
> is 3.4%, 3.9%, 1.9%, 3.8%, 10% and 17% respectively, so the effect
> appears to be more visible with smaller block numbers as well.
>
> Jörn
Hi Joern
Is the hardware the "Drais card" that you described in the following link
www.linux-kongress.org/2010/slides/logfs-engel.pdf
Since the driver exposes an mtd device, do you mount the ext4,btrfs
filesystem over any FTL?
Is it possible to have logfs over the PCIe-SSD card?
Pardon me for asking the following in this thread.
I have been trying to mount logfs and i face seg fault during unmount
. I have tested it in 2.6.34 and 2.39.1. I have asked about the
problem here.
http://comments.gmane.org/gmane.linux.file-systems/55008
Two other people have also faced umount problem in logfs
1. http://comments.gmane.org/gmane.linux.file-systems/46630
2. http://eeek.borgchat.net/lists/linux-embedded/msg02970.html
My apologies again for asking it here. Since the logfs@logfs.org
mailing list(and the wiki) doesnt work any more , i am asking the
question here. I am thankful for your reply.
Thanks,
mugunthan
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Filesystem benchmarks on reasonably fast hardware
2011-07-19 7:36 ` Jörn Engel
2011-07-19 9:23 ` srimugunthan dhandapani
@ 2011-07-19 10:15 ` Dave Chinner
1 sibling, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2011-07-19 10:15 UTC (permalink / raw)
To: Jörn Engel; +Cc: linux-fsdevel
On Tue, Jul 19, 2011 at 09:36:33AM +0200, Jörn Engel wrote:
> On Tue, 19 July 2011 12:41:38 +1000, Dave Chinner wrote:
> > On Mon, Jul 18, 2011 at 01:40:36PM +0200, Jörn Engel wrote:
> > > On Mon, 18 July 2011 20:57:49 +1000, Dave Chinner wrote:
> > > > On Mon, Jul 18, 2011 at 09:53:39AM +0200, Jörn Engel wrote:
> > > > > On Mon, 18 July 2011 09:32:52 +1000, Dave Chinner wrote:
> > > > > > On Sun, Jul 17, 2011 at 06:05:01PM +0200, Jörn Engel wrote:
> > > >
> > > > > > > xfs:
> > > > > > > ====
> > > > > > > seqrd 1 2 4 8 16 32 64 128
> > > > > > > 16384 4698 4424 4397 4402 4394 4398 4642 4679
> > > > > > > 8192 6234 5827 5797 5801 5795 6114 5793 5812
> > > > > > > 4096 9100 8835 8882 8896 8874 8890 8910 8906
> > > > > > > 2048 14922 14391 14259 14248 14264 14264 14269 14273
> > > > > > > 1024 23853 22690 22329 22362 22338 22277 22240 22301
> > > > > > > 512 37353 33990 33292 33332 33306 33296 33224 33271
> >
> > > seqrd 1 2 4 8 16 32 64 128
> > > 16384 4542 8311 15738 28955 38273 36644 38530 38527
> > > 8192 6000 10413 19208 33878 65927 76906 77083 77102
> > > 4096 8931 14971 24794 44223 83512 144867 147581 150702
> > > 2048 14375 23489 34364 56887 103053 192662 307167 309222
> > > 1024 21647 36022 49649 77163 132886 243296 421389 497581
> > > 512 31832 61257 79545 108782 176341 303836 517814 584741
> >
> > > What bothers me a bit is that the single-threaded numbers took such a
> > > noticeable hit...
> >
> > Is it reproducable? I did notice quite a bit of run-to-run variation
> > in the numbers I ran. For single threaded numbers, they appear to be
> > in the order of +/-100 ops @ 16k block size.
>
> Ime the numbers are stable within about 10%. And given that out of
> six measurements every single one is a regression, I would feel
> confident to bet a beverage without further measurements. Regression
> is 3.4%, 3.9%, 1.9%, 3.8%, 10% and 17% respectively, so the effect
> appears to be more visible with smaller block numbers as well.
Only thing I can think of then is that taking the lock shared is
more expensive than taking it exclusive. Otherwise there is little
change to the code path....
/me shrugs and cares not all that much right now
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Filesystem benchmarks on reasonably fast hardware
2011-07-17 16:05 Filesystem benchmarks on reasonably fast hardware Jörn Engel
2011-07-17 23:32 ` Dave Chinner
2011-07-18 12:07 ` Ted Ts'o
@ 2011-07-19 13:19 ` Dave Chinner
2011-07-21 10:42 ` Jörn Engel
2 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2011-07-19 13:19 UTC (permalink / raw)
To: Jörn Engel; +Cc: linux-fsdevel
On Sun, Jul 17, 2011 at 06:05:01PM +0200, Jörn Engel wrote:
> xfs:
> ====
.....
> seqwr 1 2 4 8 16 32 64 128
> 16384 39956 39695 39971 39913 37042 37538 36591 32179
> 8192 67934 66073 30963 29038 29852 25210 23983 28272
> 4096 89250 81417 28671 18685 12917 14870 22643 22237
> 2048 140272 120588 140665 140012 137516 139183 131330 129684
> 1024 217473 147899 210350 218526 219867 220120 219758 215166
> 512 328260 181197 211131 263533 294009 298203 301698 298013
OK, I can explain the pattern here where throughput drops off at 2-4
threads. It's not as simple as the seqrd case, but it's related to
the fact that this workload is an append write workload. See the
patch description below for why that matters.
As it is, the numbers I get for 16k seqwr on my hardawre are as
follows:
seqwr 1 2 4 8 16
vanilla 3072 2734 2506 not tested...
patched 2984 4156 4922 5175 5120
Looks like my hardware is topping out at ~5-6kiops no matter the
block size here. Which, no matter how you look at it, is a
significant improvement. ;)
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
xfs: don't serialise adjacent concurrent direct IO appending writes
For append write workloads, extending the file requires a certain
amount of exclusive locking to be done up front to ensure sanity in
things like ensuring that we've zeroed any allocated regions
between the old EOF and the start of the new IO.
For single threads, this typically isn't a problem, and for large
IOs we don't serialise enough for it to be a problem for two
threads on really fast block devices. However for smaller IO and
larger thread counts we have a problem.
Take 4 concurrent sequential, single block sized and aligned IOs.
After the first IO is submitted but before it completes, we end up
with this state:
IO 1 IO 2 IO 3 IO 4
+-------+-------+-------+-------+
^ ^
| |
| |
| |
| \- ip->i_new_size
\- ip->i_size
And the IO is done without exclusive locking because offset <=
ip->i_size. When we submit IO 2, we see offset > ip->i_size, and
grab the IO lock exclusive, because there is a chance we need to do
EOF zeroing. However, there is already an IO in progress that avoids
the need for IO zeroing because offset <= ip->i_new_size. hence we
could avoid holding the IO lock exlcusive for this. Hence after
submission of the second IO, we'd end up this state:
IO 1 IO 2 IO 3 IO 4
+-------+-------+-------+-------+
^ ^
| |
| |
| |
| \- ip->i_new_size
\- ip->i_size
There is no need to grab the i_mutex of the IO lock in exclusive
mode if we don't need to invalidate the page cache. Taking these
locks on every direct IO effective serialises them as taking the IO
lock in exclusive mode has to wait for all shared holders to drop
the lock. That only happens when IO is complete, so effective it
prevents dispatch of concurrent direct IO writes to the same inode.
And so you can see that for the third concurrent IO, we'd avoid
exclusive locking for the same reason we avoided the exclusive lock
for the second IO.
Fixing this is a bit more complex than that, because we need to hold
a write-submission local value of ip->i_new_size to that clearing
the value is only done if no other thread has updated it before our
IO completes.....
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/linux-2.6/xfs_aops.c | 7 ++++
fs/xfs/linux-2.6/xfs_file.c | 69 ++++++++++++++++++++++++++++++++++---------
2 files changed, 62 insertions(+), 14 deletions(-)
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index 63e971e..dda9a9e 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -176,6 +176,13 @@ xfs_setfilesize(
if (unlikely(ioend->io_error))
return 0;
+ /*
+ * If the IO is clearly not beyond the on-disk inode size,
+ * return before we take locks.
+ */
+ if (ioend->io_offset + ioend->io_size <= ip->i_d.di_size)
+ return 0;
+
if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
return EAGAIN;
diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c
index 16a4bf0..5b6703a 100644
--- a/fs/xfs/linux-2.6/xfs_file.c
+++ b/fs/xfs/linux-2.6/xfs_file.c
@@ -422,11 +422,13 @@ xfs_aio_write_isize_update(
*/
STATIC void
xfs_aio_write_newsize_update(
- struct xfs_inode *ip)
+ struct xfs_inode *ip,
+ xfs_fsize_t new_size)
{
- if (ip->i_new_size) {
+ if (new_size == ip->i_new_size) {
xfs_rw_ilock(ip, XFS_ILOCK_EXCL);
- ip->i_new_size = 0;
+ if (new_size == ip->i_new_size)
+ ip->i_new_size = 0;
if (ip->i_d.di_size > ip->i_size)
ip->i_d.di_size = ip->i_size;
xfs_rw_iunlock(ip, XFS_ILOCK_EXCL);
@@ -478,7 +480,7 @@ xfs_file_splice_write(
count, flags);
xfs_aio_write_isize_update(inode, ppos, ret);
- xfs_aio_write_newsize_update(ip);
+ xfs_aio_write_newsize_update(ip, new_size);
xfs_iunlock(ip, XFS_IOLOCK_EXCL);
return ret;
}
@@ -675,6 +677,7 @@ xfs_file_aio_write_checks(
struct file *file,
loff_t *pos,
size_t *count,
+ xfs_fsize_t *new_sizep,
int *iolock)
{
struct inode *inode = file->f_mapping->host;
@@ -682,6 +685,8 @@ xfs_file_aio_write_checks(
xfs_fsize_t new_size;
int error = 0;
+restart:
+ *new_sizep = 0;
error = generic_write_checks(file, pos, count, S_ISBLK(inode->i_mode));
if (error) {
xfs_rw_iunlock(ip, XFS_ILOCK_EXCL | *iolock);
@@ -689,9 +694,18 @@ xfs_file_aio_write_checks(
return error;
}
+ /*
+ * if we are writing beyond the current EOF, only update the
+ * ip->i_new_size if it is larger than any other concurrent write beyond
+ * EOF. Regardless of whether we update ip->i_new_size, return the
+ * updated new_size to the caller.
+ */
new_size = *pos + *count;
- if (new_size > ip->i_size)
- ip->i_new_size = new_size;
+ if (new_size > ip->i_size) {
+ if (new_size > ip->i_new_size)
+ ip->i_new_size = new_size;
+ *new_sizep = new_size;
+ }
if (likely(!(file->f_mode & FMODE_NOCMTIME)))
file_update_time(file);
@@ -699,10 +713,22 @@ xfs_file_aio_write_checks(
/*
* If the offset is beyond the size of the file, we need to zero any
* blocks that fall between the existing EOF and the start of this
- * write.
+ * write. Don't issue zeroing if this IO is adjacent to an IO already in
+ * flight. If we are currently holding the iolock shared, we need to
+ * update it to exclusive which involves dropping all locks and
+ * relocking to maintain correct locking order. If we do this, restart
+ * the function to ensure all checks and values are still valid.
*/
- if (*pos > ip->i_size)
+ if ((ip->i_new_size && *pos > ip->i_new_size) ||
+ (!ip->i_new_size && *pos > ip->i_size)) {
+ if (*iolock == XFS_IOLOCK_SHARED) {
+ xfs_rw_iunlock(ip, XFS_ILOCK_EXCL | *iolock);
+ *iolock = XFS_IOLOCK_EXCL;
+ xfs_rw_ilock(ip, XFS_ILOCK_EXCL | *iolock);
+ goto restart;
+ }
error = -xfs_zero_eof(ip, *pos, ip->i_size);
+ }
xfs_rw_iunlock(ip, XFS_ILOCK_EXCL);
if (error)
@@ -749,6 +775,7 @@ xfs_file_dio_aio_write(
unsigned long nr_segs,
loff_t pos,
size_t ocount,
+ xfs_fsize_t *new_size,
int *iolock)
{
struct file *file = iocb->ki_filp;
@@ -769,13 +796,25 @@ xfs_file_dio_aio_write(
if ((pos & mp->m_blockmask) || ((pos + count) & mp->m_blockmask))
unaligned_io = 1;
- if (unaligned_io || mapping->nrpages || pos > ip->i_size)
+ /*
+ * Tricky locking alert: if we are doing multiple concurrent sequential
+ * writes (e.g. via aio), we don't need to do EOF zeroing if the current
+ * IO is adjacent to an in-flight IO. That means for such IO we can
+ * avoid taking the IOLOCK exclusively. Hence we avoid checking for
+ * writes beyond EOF at this point when deciding what lock to take.
+ * We will take the IOLOCK exclusive later if necessary.
+ *
+ * This, however, means that we need a local copy of the ip->i_new_size
+ * value from this IO if we change it so that we can determine if we can
+ * clear the value from the inode when this IO completes.
+ */
+ if (unaligned_io || mapping->nrpages)
*iolock = XFS_IOLOCK_EXCL;
else
*iolock = XFS_IOLOCK_SHARED;
xfs_rw_ilock(ip, XFS_ILOCK_EXCL | *iolock);
- ret = xfs_file_aio_write_checks(file, &pos, &count, iolock);
+ ret = xfs_file_aio_write_checks(file, &pos, &count, new_size, iolock);
if (ret)
return ret;
@@ -814,6 +853,7 @@ xfs_file_buffered_aio_write(
unsigned long nr_segs,
loff_t pos,
size_t ocount,
+ xfs_fsize_t *new_size,
int *iolock)
{
struct file *file = iocb->ki_filp;
@@ -827,7 +867,7 @@ xfs_file_buffered_aio_write(
*iolock = XFS_IOLOCK_EXCL;
xfs_rw_ilock(ip, XFS_ILOCK_EXCL | *iolock);
- ret = xfs_file_aio_write_checks(file, &pos, &count, iolock);
+ ret = xfs_file_aio_write_checks(file, &pos, &count, new_size, iolock);
if (ret)
return ret;
@@ -867,6 +907,7 @@ xfs_file_aio_write(
ssize_t ret;
int iolock;
size_t ocount = 0;
+ xfs_fsize_t new_size = 0;
XFS_STATS_INC(xs_write_calls);
@@ -886,10 +927,10 @@ xfs_file_aio_write(
if (unlikely(file->f_flags & O_DIRECT))
ret = xfs_file_dio_aio_write(iocb, iovp, nr_segs, pos,
- ocount, &iolock);
+ ocount, &new_size, &iolock);
else
ret = xfs_file_buffered_aio_write(iocb, iovp, nr_segs, pos,
- ocount, &iolock);
+ ocount, &new_size, &iolock);
xfs_aio_write_isize_update(inode, &iocb->ki_pos, ret);
@@ -914,7 +955,7 @@ xfs_file_aio_write(
}
out_unlock:
- xfs_aio_write_newsize_update(ip);
+ xfs_aio_write_newsize_update(ip, new_size);
xfs_rw_iunlock(ip, iolock);
return ret;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: Filesystem benchmarks on reasonably fast hardware
2011-07-19 13:19 ` Dave Chinner
@ 2011-07-21 10:42 ` Jörn Engel
2011-07-22 18:51 ` Jörn Engel
0 siblings, 1 reply; 21+ messages in thread
From: Jörn Engel @ 2011-07-21 10:42 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-fsdevel
On Tue, 19 July 2011 23:19:58 +1000, Dave Chinner wrote:
> On Sun, Jul 17, 2011 at 06:05:01PM +0200, Jörn Engel wrote:
> > xfs:
> > ====
> .....
> > seqwr 1 2 4 8 16 32 64 128
> > 16384 39956 39695 39971 39913 37042 37538 36591 32179
> > 8192 67934 66073 30963 29038 29852 25210 23983 28272
> > 4096 89250 81417 28671 18685 12917 14870 22643 22237
> > 2048 140272 120588 140665 140012 137516 139183 131330 129684
> > 1024 217473 147899 210350 218526 219867 220120 219758 215166
> > 512 328260 181197 211131 263533 294009 298203 301698 298013
>
> OK, I can explain the pattern here where throughput drops off at 2-4
> threads. It's not as simple as the seqrd case, but it's related to
> the fact that this workload is an append write workload. See the
> patch description below for why that matters.
>
> As it is, the numbers I get for 16k seqwr on my hardawre are as
> follows:
>
> seqwr 1 2 4 8 16
> vanilla 3072 2734 2506 not tested...
> patched 2984 4156 4922 5175 5120
>
> Looks like my hardware is topping out at ~5-6kiops no matter the
> block size here. Which, no matter how you look at it, is a
> significant improvement. ;)
My numbers include some regressions, although the improvements clearly
dominate. Below is a diff (or div) between new kernel with both your
patches applied and vanilla. >1 means improvement, <1 means
regression.
seqrd 1 2 4 8 16 32 64 128
16384 1.037 1.975 3.726 6.643 8.901 8.902 8.431 8.365
8192 1.015 1.871 3.459 6.424 11.457 12.829 13.542 13.490
4096 1.009 1.790 2.942 5.179 9.634 16.667 17.652 17.666
2048 1.005 1.709 2.525 4.196 7.479 14.022 22.032 22.100
1024 1.017 1.624 2.328 3.587 6.112 11.365 20.311 21.315
512 1.012 1.829 2.374 3.365 5.352 9.459 16.809 18.771
rndrd 1 2 4 8 16 32 64 128
16384 1.042 1.037 1.036 1.043 1.051 1.011 1.002 1.001
8192 1.020 1.020 1.028 1.040 1.057 1.064 1.002 1.001
4096 1.011 1.007 1.021 1.036 1.059 1.086 1.021 1.001
2048 1.002 1.010 1.018 1.025 1.057 1.100 1.098 1.003
1024 1.001 1.002 1.023 1.007 1.072 1.112 1.162 1.102
512 0.998 1.010 1.004 1.035 1.088 1.121 1.156 1.127
seqwr 1 2 4 8 16 32 64 128
16384 0.942 0.949 0.942 0.945 1.017 1.004 1.030 1.172
8192 1.144 1.177 2.517 2.687 2.611 3.091 3.246 2.741
4096 1.389 1.506 4.228 6.443 9.313 8.064 5.276 5.394
2048 1.139 1.278 1.080 1.076 1.094 1.087 1.142 1.148
1024 0.852 1.190 0.806 0.783 0.776 0.774 0.769 0.774
512 0.709 1.273 1.055 0.847 0.758 0.744 0.738 0.746
rndwr 1 2 4 8 16 32 64 128
16384 1.013 1.003 1.002 1.005 1.007 1.006 1.003 1.002
8192 1.023 1.005 1.007 1.006 1.006 1.004 1.004 1.006
4096 1.020 1.007 1.007 1.007 1.007 1.007 1.008 1.007
2048 0.901 1.017 1.007 1.008 1.008 1.009 1.008 1.007
1024 0.848 0.949 1.003 0.990 1.001 1.006 1.006 1.005
512 0.821 0.833 0.948 0.956 0.935 0.929 0.921 0.914
Raw results:
seqrd 1 2 4 8 16 32 64 128
16384 4873 8738 16382 29241 39111 39152 39137 39140
8192 6326 10900 20054 37263 66391 78437 78449 78404
4096 9181 15816 26130 46073 85492 148172 157276 157329
2048 14995 24588 36009 59790 106685 200012 314373 315440
1024 24248 36841 51972 80207 136529 253175 451709 475353
512 37813 62164 79048 112175 178246 314959 558458 624534
rndrd 1 2 4 8 16 32 64 128
16384 4778 8554 14724 23507 33666 39065 39109 39104
8192 6152 11409 20862 35814 56123 75776 78370 78380
4096 8335 15643 29662 53953 91867 136643 157314 157325
2048 11973 22885 43474 81545 148087 239997 314198 315680
1024 16547 31345 61123 113283 222737 387234 562457 632767
512 20590 40134 73333 135621 294448 543117 793329 818861
seqwr 1 2 4 8 16 32 64 128
16384 37629 37651 37667 37711 37658 37674 37687 37727
8192 77691 77747 77948 78017 77940 77931 77847 77488
4096 123997 122607 121219 120394 120301 119908 119457 119939
2048 159816 154063 151987 150608 150449 151298 150016 148852
1024 185215 175977 169562 171078 170649 170420 169076 166614
512 232890 230669 222830 223140 222877 221812 222588 222369
rndwr 1 2 4 8 16 32 64 128
16384 38944 38256 38227 38312 38438 38432 38331 38313
8192 79773 77378 77453 77425 77473 77500 77458 77535
4096 163925 157167 158258 158192 158244 158281 158229 158252
2048 293295 322480 320206 321022 321375 321926 322298 322558
1024 368010 616516 652359 645514 654715 658132 659513 659125
512 411236 730015 1223437 1164632 1163705 1178235 1184450 1186594
Jörn
--
Ninety percent of everything is crap.
-- Sturgeon's Law
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Filesystem benchmarks on reasonably fast hardware
2011-07-19 9:23 ` srimugunthan dhandapani
@ 2011-07-21 19:05 ` Jörn Engel
0 siblings, 0 replies; 21+ messages in thread
From: Jörn Engel @ 2011-07-21 19:05 UTC (permalink / raw)
To: srimugunthan dhandapani; +Cc: linux-fsdevel
On Tue, 19 July 2011 14:53:08 +0530, srimugunthan dhandapani wrote:
>
> Is the hardware the "Drais card" that you described in the following link
> www.linux-kongress.org/2010/slides/logfs-engel.pdf
Yes.
> Since the driver exposes an mtd device, do you mount the ext4,btrfs
> filesystem over any FTL?
That was last year. In the mean time I've added an FTL to the driver,
so the card behaves like a regular ssd. Well, mostly.
> Is it possible to have logfs over the PCIe-SSD card?
YeaaaNo! Not anymore. Could be lack of error correction in the
current driver or could be bitrot. Logfs over loopback seems to work
just fine, so if it is bitrot, it is limited to the mtd interface.
> Pardon me for asking the following in this thread.
> I have been trying to mount logfs and i face seg fault during unmount
> . I have tested it in 2.6.34 and 2.39.1. I have asked about the
> problem here.
> http://comments.gmane.org/gmane.linux.file-systems/55008
>
> Two other people have also faced umount problem in logfs
>
> 1. http://comments.gmane.org/gmane.linux.file-systems/46630
> 2. http://eeek.borgchat.net/lists/linux-embedded/msg02970.html
>
> My apologies again for asking it here. Since the logfs@logfs.org
> mailing list(and the wiki) doesnt work any more , i am asking the
> question here. I am thankful for your reply.
Yes, ever since that machine died I have basically been the
non-maintainer of logfs. In a different century I would have been
hanged, drawn and quartered for it. Give me some time to test the mtd
side and see what's up.
Jörn
--
Write programs that do one thing and do it well. Write programs to work
together. Write programs to handle text streams, because that is a
universal interface.
-- Doug MacIlroy
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Filesystem benchmarks on reasonably fast hardware
2011-07-21 10:42 ` Jörn Engel
@ 2011-07-22 18:51 ` Jörn Engel
0 siblings, 0 replies; 21+ messages in thread
From: Jörn Engel @ 2011-07-22 18:51 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-fsdevel
On Thu, 21 July 2011 12:42:46 +0200, Jörn Engel wrote:
> On Tue, 19 July 2011 23:19:58 +1000, Dave Chinner wrote:
> > On Sun, Jul 17, 2011 at 06:05:01PM +0200, Jörn Engel wrote:
>
> [ Crap ]
I had tested ext4 with two xfs patches. Try these numbers instead.
Both patches have my endorsement. Excellent work!
seqrd 1 2 4 8 16 32 64 128
16384 1.000 1.880 3.456 6.297 8.727 8.703 8.271 8.208
8192 1.001 1.811 3.304 6.153 10.061 12.567 13.248 12.077
4096 1.001 1.752 2.832 4.968 9.199 15.937 17.228 17.139
2048 1.001 1.689 2.459 4.053 7.152 13.241 21.565 21.694
1024 1.011 1.619 2.296 3.521 5.935 10.849 19.649 27.848
512 1.008 1.825 2.371 3.310 5.230 9.146 16.591 27.234
rndrd 1 2 4 8 16 32 64 128
16384 1.003 1.005 1.009 1.021 1.032 1.009 1.001 1.001
8192 1.002 1.004 1.013 1.024 1.041 1.051 1.001 1.001
4096 1.003 1.004 1.013 1.027 1.049 1.071 1.020 1.000
2048 1.004 1.010 1.019 1.011 1.052 1.091 1.091 1.002
1024 1.003 1.009 1.028 1.027 1.068 1.109 1.155 1.099
512 1.002 1.014 1.016 1.044 1.083 1.125 1.196 1.236
seqwr 1 2 4 8 16 32 64 128
16384 1.003 1.001 0.981 0.953 0.995 0.947 1.057 1.203
8192 0.999 1.048 2.120 2.060 1.799 1.991 2.093 1.998
4096 0.991 1.074 2.901 3.878 5.218 4.030 2.358 2.601
2048 1.005 1.273 1.058 1.077 1.112 1.123 1.137 1.161
1024 0.999 1.605 1.147 1.059 1.059 1.047 1.064 1.069
512 0.947 1.978 1.618 1.317 1.181 1.156 1.149 1.134
rndwr 1 2 4 8 16 32 64 128
16384 1.000 0.999 1.000 1.001 1.000 1.000 1.001 0.999
8192 0.999 1.000 1.000 1.001 1.000 1.001 1.001 1.003
4096 0.997 0.998 1.000 1.000 1.001 1.000 1.001 1.000
2048 1.002 1.001 1.001 1.003 1.001 1.002 1.000 1.000
1024 0.998 1.001 1.000 1.001 1.000 1.001 0.999 1.001
512 1.044 0.999 1.003 1.001 1.001 1.001 1.002 0.998
seqrd 1 2 4 8 16 32 64 128
16384 4700 8316 15197 27721 38348 38277 38394 38406
8192 6241 10551 19156 35692 58304 76835 76743 70192
4096 9110 15477 25155 44196 81632 141681 153499 152642
2048 14942 24309 35063 57754 102009 188865 307705 309641
1024 24104 36724 51278 78737 132577 241681 437003 621032
512 37646 62022 78943 110334 174203 304532 551212 906087
rndrd 1 2 4 8 16 32 64 128
16384 4598 8288 14352 22999 33051 38977 39072 39086
8192 6042 11233 20566 35279 55300 74863 78278 78359
4096 8268 15604 29428 53514 91016 134799 157045 157144
2048 11997 22877 43550 80430 147372 237967 312170 315369
1024 16578 31577 61419 115548 221986 386119 558797 631441
512 20668 40293 74185 136774 293068 545050 820771 897897
seqwr 1 2 4 8 16 32 64 128
16384 40074 39718 39198 38027 36846 35562 38659 38726
8192 67896 69240 65628 59807 53713 50181 50208 56486
4096 88439 87416 83167 72468 67401 59932 53383 57845
2048 141003 153543 148813 150740 152966 156238 149370 150576
1024 217311 237402 241186 231341 232902 230429 233877 230095
512 310980 358427 341578 347183 347281 344722 346779 337970
rndwr 1 2 4 8 16 32 64 128
16384 38436 38112 38154 38161 38174 38208 38250 38197
8192 77890 76972 76938 76993 77031 77255 77274 77301
4096 160246 155612 157142 157090 157213 157081 157193 157160
2048 326008 317372 318089 319273 318994 319596 319773 320299
1024 433107 650226 649868 652195 653764 654760 655299 656246
512 523091 875267 1294281 1218935 1245993 1269267 1287429 1296046
Jörn
--
Fools ignore complexity. Pragmatists suffer it.
Some can avoid it. Geniuses remove it.
-- Perlis's Programming Proverb #58, SIGPLAN Notices, Sept. 1982
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Filesystem benchmarks on reasonably fast hardware
2011-07-18 12:42 ` Jörn Engel
@ 2011-07-25 15:18 ` Ted Ts'o
2011-07-25 18:20 ` Jörn Engel
0 siblings, 1 reply; 21+ messages in thread
From: Ted Ts'o @ 2011-07-25 15:18 UTC (permalink / raw)
To: Jörn Engel; +Cc: linux-fsdevel, linux-ext4
On Mon, Jul 18, 2011 at 02:42:29PM +0200, Jörn Engel wrote:
> On Mon, 18 July 2011 08:07:51 -0400, Ted Ts'o wrote:
> >
> > Can you send me your script and the lockstat for ext4?
>
> Attached. The first script generates a bunch of files, the second
> condenses them into the tabular form. Will need some massaging to
> work on anything other than my particular setup, sorry.
>
> > (Please cc the linux-ext4@vger.kernel.org list if you don't mind.
> > Thanks!!)
>
> Sure. Lockstat will come later today. The machine is currently busy
> regenerating xfs seqrd numbers.
Hi Jörn,
Did you have a chance to do an ext4 lockstat run?
Many thanks!!
- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Filesystem benchmarks on reasonably fast hardware
2011-07-25 15:18 ` Ted Ts'o
@ 2011-07-25 18:20 ` Jörn Engel
2011-07-25 21:18 ` Ted Ts'o
2011-07-26 14:57 ` Ted Ts'o
0 siblings, 2 replies; 21+ messages in thread
From: Jörn Engel @ 2011-07-25 18:20 UTC (permalink / raw)
To: Ted Ts'o; +Cc: linux-fsdevel, linux-ext4
On Mon, 25 July 2011 11:18:25 -0400, Ted Ts'o wrote:
>
> Did you have a chance to do an ext4 lockstat run?
Yes, I did. But your mails keep bouncing, so you have to look at the
list to see it (or this mail). Yes, I lack a proper reverse DNS
record, as the IP belongs to my provider, not me. Most people don't
care, some bounce, some silently ignore my mail. The joys of spam
filtering.
Jörn
--
The rabbit runs faster than the fox, because the rabbit is rinning for
his life while the fox is only running for his dinner.
-- Aesop
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Filesystem benchmarks on reasonably fast hardware
2011-07-25 18:20 ` Jörn Engel
@ 2011-07-25 21:18 ` Ted Ts'o
2011-07-26 14:57 ` Ted Ts'o
1 sibling, 0 replies; 21+ messages in thread
From: Ted Ts'o @ 2011-07-25 21:18 UTC (permalink / raw)
To: Jörn Engel; +Cc: linux-fsdevel, linux-ext4
On Mon, Jul 25, 2011 at 08:20:37PM +0200, Jörn Engel wrote:
> On Mon, 25 July 2011 11:18:25 -0400, Ted Ts'o wrote:
> >
> > Did you have a chance to do an ext4 lockstat run?
>
> Yes, I did. But your mails keep bouncing, so you have to look at the
> list to see it (or this mail). Yes, I lack a proper reverse DNS
> record, as the IP belongs to my provider, not me. Most people don't
> care, some bounce, some silently ignore my mail. The joys of spam
> filtering.
I didn't see the ext4 lockstat on the list. Can you resend it to
tytso@google.com or theodore.tso@gmail.com? MIT is using an
outsourced SPAM provider (Brightmail anti-spam), and I can't do
anything about that, unfortunately. From what I can tell the
Brightmail doesn't drop all e-mails from non-resolving IP's, but if
it's in a "bad neighborhood" (i.e., your neighbors are all spammers,
or belong to Windows users where 80% of the machines are spambots),
Brightmail is probably going to flag your mail as spam. :-(
Thanks!
- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Filesystem benchmarks on reasonably fast hardware
2011-07-25 18:20 ` Jörn Engel
2011-07-25 21:18 ` Ted Ts'o
@ 2011-07-26 14:57 ` Ted Ts'o
2011-07-27 3:39 ` Yongqiang Yang
1 sibling, 1 reply; 21+ messages in thread
From: Ted Ts'o @ 2011-07-26 14:57 UTC (permalink / raw)
To: Jörn Engel; +Cc: linux-fsdevel, linux-ext4
On Mon, Jul 25, 2011 at 08:20:37PM +0200, Jörn Engel wrote:
> On Mon, 25 July 2011 11:18:25 -0400, Ted Ts'o wrote:
> >
> > Did you have a chance to do an ext4 lockstat run?
Hi Jörn,
Thanks for forwarding it to me. It's the same problem as in XFS, the
excessive coverage of the i_mutex lock. In ext4's case, it's in the
generic generic_file_aio_write() machinery where we need to do the
lock busting. (XFS apparently doesn't use the generic routines, so
the fix that Dave did won't help ext3 and ext4.)
I don't have the time to look at it now, but I'll put it on my todo
list; or maybe someone with a bit more time can look into how we might
be able to use a similar approach in the generic file system code.
- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Filesystem benchmarks on reasonably fast hardware
2011-07-26 14:57 ` Ted Ts'o
@ 2011-07-27 3:39 ` Yongqiang Yang
0 siblings, 0 replies; 21+ messages in thread
From: Yongqiang Yang @ 2011-07-27 3:39 UTC (permalink / raw)
To: Jörn Engel, Ted Ts'o; +Cc: linux-fsdevel, linux-ext4
Hi Jörn and Ted,
Could you anyone send out the ext4 lock stat on the list?
Thank you,
Yongqiang.
On Tue, Jul 26, 2011 at 10:57 PM, Ted Ts'o <tytso@mit.edu> wrote:
> On Mon, Jul 25, 2011 at 08:20:37PM +0200, Jörn Engel wrote:
>> On Mon, 25 July 2011 11:18:25 -0400, Ted Ts'o wrote:
>> >
>> > Did you have a chance to do an ext4 lockstat run?
>
> Hi Jörn,
>
> Thanks for forwarding it to me. It's the same problem as in XFS, the
> excessive coverage of the i_mutex lock. In ext4's case, it's in the
> generic generic_file_aio_write() machinery where we need to do the
> lock busting. (XFS apparently doesn't use the generic routines, so
> the fix that Dave did won't help ext3 and ext4.)
>
> I don't have the time to look at it now, but I'll put it on my todo
> list; or maybe someone with a bit more time can look into how we might
> be able to use a similar approach in the generic file system code.
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Best Wishes
Yongqiang Yang
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2011-07-27 3:39 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-17 16:05 Filesystem benchmarks on reasonably fast hardware Jörn Engel
2011-07-17 23:32 ` Dave Chinner
[not found] ` <20110718075339.GB1437@logfs.org>
2011-07-18 10:57 ` Dave Chinner
2011-07-18 11:40 ` Jörn Engel
2011-07-19 2:41 ` Dave Chinner
2011-07-19 7:36 ` Jörn Engel
2011-07-19 9:23 ` srimugunthan dhandapani
2011-07-21 19:05 ` Jörn Engel
2011-07-19 10:15 ` Dave Chinner
2011-07-18 14:34 ` Jörn Engel
[not found] ` <20110718103956.GE1437@logfs.org>
2011-07-18 11:10 ` Dave Chinner
2011-07-18 12:07 ` Ted Ts'o
2011-07-18 12:42 ` Jörn Engel
2011-07-25 15:18 ` Ted Ts'o
2011-07-25 18:20 ` Jörn Engel
2011-07-25 21:18 ` Ted Ts'o
2011-07-26 14:57 ` Ted Ts'o
2011-07-27 3:39 ` Yongqiang Yang
2011-07-19 13:19 ` Dave Chinner
2011-07-21 10:42 ` Jörn Engel
2011-07-22 18:51 ` Jörn Engel
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.