All of lore.kernel.org
 help / color / mirror / Atom feed
* Filesystem benchmarks on reasonably fast hardware
@ 2011-07-17 16:05 Jörn Engel
  2011-07-17 23:32 ` Dave Chinner
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Jörn Engel @ 2011-07-17 16:05 UTC (permalink / raw)
  To: linux-fsdevel

Hello everyone!

Recently I have had the pleasure of working with some nice hardware
and the displeasure of seeing it fail commercially.  However, when
trying to optimize performance I noticed that in some cases the
bottlenecks were not in the hardware or my driver, but rather in the
filesystem on top of it.  So maybe all this may still be useful in
improving said filesystem.

Hardware is basically a fast SSD.  Performance tops out at about
650MB/s and is fairly insensitive to random access behaviour.  Latency
is about 50us for 512B reads and near 0 for writes, through the usual
cheating.

Numbers below were created with sysbench, using directIO.  Each block
is a matrix with results for blocksizes from 512B to 16384B and thread
count from 1 to 128.  Four blocks for reads and writes, both
sequential and random.

Ext4:
=====
seqrd	1	2	4	8	16	32	64	128
16384	4867	8717	16367	29249	39131	39140	39135	39123	
8192	6324	10889	19980	37239	66346	78444	78429	78409	
4096	9158	15810	26072	45999	85371	148061	157222	157294	
2048	15019	24555	35934	59698	106541	198986	313969	315566	
1024	24271	36914	51845	80230	136313	252832	454153	484120	
512	37803	62144	78952	111489	177844	314896	559295	615744	

rndrd	1	2	4	8	16	32	64	128
16384	4770	8539	14715	23465	33630	39073	39101	39103	
8192	6138	11398	20873	35785	56068	75743	78364	78374	
4096	8338	15657	29648	53927	91854	136595	157279	157349	
2048	11985	22894	43495	81211	148029	239962	314183	315695	
1024	16529	31306	61307	114721	222700	387439	561810	632719	
512	20580	40160	73642	135711	294583	542828	795607	821025	

seqwr	1	2	4	8	16	32	64	128
16384	37588	37600	37730	37680	37631	37664	37670	37662	
8192	77621	77737	77947	77967	77875	77939	77833	77574	
4096	124083	123171	121159	120947	120202	120315	119917	120236	
2048	158540	153993	151128	150663	150686	151159	150358	147827	
1024	183300	176091	170527	170919	169608	169900	169662	168622	
512	229167	231672	221629	220416	223490	217877	222390	219718	

rndwr	1	2	4	8	16	32	64	128
16384	38932	38290	38200	38306	38421	38404	38329	38326	
8192	79790	77297	77464	77447	77420	77460	77495	77545	
4096	163985	157626	158232	158212	158102	158169	158273	158236	
2048	272261	322637	320032	320932	321597	322008	322242	322699	
1024	339647	609192	652655	644903	654604	658292	659119	659667	
512	403366	718426	1227643	1149850	1155541	1157633	1173567	1180710	

Sequestial writes are significantly worse than random writes.  If
someone is interested, I can see which lock is causing all this.
Sequential reads below 2k are also worse, although one might wonder
whether direct IO on 1k chunks makes sense at all.  Random reads in
the last column scale very nicely with block size down to 1k, but hit
some problem at 512B.  The machine could be cpu-bound at this point.


Btrfs:
======
seqrd	1	2	4	8	16	32	64	128
16384	3270	6582	12919	24866	36424	39682	39726	39721	
8192	4394	8348	16483	32165	54221	79256	79396	79415	
4096	6337	12024	21696	40569	74924	131763	158292	158763	
2048	297222	298299	294727	294740	296496	298517	300118	300740	
1024	583891	595083	584272	580965	584030	589115	599634	598054	
512	1103026	1175523	1134172	1133606	1123684	1123978	1156758	1130354	

rndrd	1	2	4	8	16	32	64	128
16384	3252	6621	12437	20354	30896	39365	39115	39746	
8192	4273	8749	17871	32135	51812	72715	79443	79456	
4096	5842	11900	24824	48072	84485	128721	158631	158812	
2048	7177	12540	20244	27543	32386	34839	35728	35916	
1024	7178	12577	20341	27473	32656	34763	36056	35960	
512	7176	12554	20289	27603	32504	34781	35983	35919	

seqwr	1	2	4	8	16	32	64	128
16384	13357	12838	12604	12596	12588	12641	12716	12814	
8192	21426	20471	20090	20097	20287	20236	20445	20528	
4096	30740	29187	28528	28525	28576	28580	28883	29258	
2048	2949	3214	3360	3431	3440	3498	3396	3498	
1024	2167	2205	2412	2376	2473	2221	2410	2420	
512	1888	1876	1926	1981	1935	1938	1957	1976	

rndwr	1	2	4	8	16	32	64	128
16384	10985	19312	27430	27813	28157	28528	28308	28234	
8192	16505	29420	35329	34925	36020	34976	35897	35174	
4096	21894	31724	34106	34799	36119	36608	37571	36274	
2048	3637	8031	15225	22599	30882	31966	32567	32427	
1024	3704	8121	15219	23670	31784	33156	31469	33547	
512	3604	7988	15206	23742	32007	31933	32523	33667	

Sequential writes below 4k perform drastically worse.  Quite
unexpected.  Write performance across the board is horrible when
compared to ext4.  Sequential reads are much better, in particular for
<4k cases.  I would assume some sort of readahead is happening.
Random reads <4k again drop off significantly.


xfs:
====
seqrd	1	2	4	8	16	32	64	128
16384	4698	4424	4397	4402	4394	4398	4642	4679	
8192	6234	5827	5797	5801	5795	6114	5793	5812	
4096	9100	8835	8882	8896	8874	8890	8910	8906	
2048	14922	14391	14259	14248	14264	14264	14269	14273	
1024	23853	22690	22329	22362	22338	22277	22240	22301	
512	37353	33990	33292	33332	33306	33296	33224	33271	

rndrd	1	2	4	8	16	32	64	128
16384	4585	8248	14219	22533	32020	38636	39033	39054	
8192	6032	11186	20294	34443	53112	71228	78197	78284	
4096	8247	15539	29046	52090	86744	125835	154031	157143	
2048	11950	22652	42719	79562	140133	218092	286111	314870	
1024	16526	31294	59761	112494	207848	348226	483972	574403	
512	20635	39755	73010	130992	270648	484406	686190	726615	

seqwr	1	2	4	8	16	32	64	128
16384	39956	39695	39971	39913	37042	37538	36591	32179	
8192	67934	66073	30963	29038	29852	25210	23983	28272	
4096	89250	81417	28671	18685	12917	14870	22643	22237	
2048	140272	120588	140665	140012	137516	139183	131330	129684	
1024	217473	147899	210350	218526	219867	220120	219758	215166	
512	328260	181197	211131	263533	294009	298203	301698	298013	

rndwr	1	2	4	8	16	32	64	128
16384	38447	38153	38145	38140	38156	38199	38208	38236	
8192	78001	76965	76908	76945	77023	77174	77166	77106	
4096	160721	156000	157196	157084	157078	157123	156978	157149	
2048	325395	317148	317858	318442	318750	318981	319798	320393	
1024	434084	649814	650176	651820	653928	654223	655650	655818	
512	501067	876555	1290292	1217671	1244399	1267729	1285469	1298522	

Sequential reads are pretty horrible.  Sequential writes are hitting a
hot lock again.

So, if anyone would like to improve one of these filesystems and needs
more data, feel free to ping me.

Jörn

-- 
Victory in war is not repetitious.
-- Sun Tzu
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Filesystem benchmarks on reasonably fast hardware
  2011-07-17 16:05 Filesystem benchmarks on reasonably fast hardware Jörn Engel
@ 2011-07-17 23:32 ` Dave Chinner
       [not found]   ` <20110718075339.GB1437@logfs.org>
  2011-07-18 12:07 ` Ted Ts'o
  2011-07-19 13:19 ` Dave Chinner
  2 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2011-07-17 23:32 UTC (permalink / raw)
  To: Jörn Engel; +Cc: linux-fsdevel

On Sun, Jul 17, 2011 at 06:05:01PM +0200, Jörn Engel wrote:
> Hello everyone!
> 
> Recently I have had the pleasure of working with some nice hardware
> and the displeasure of seeing it fail commercially.  However, when
> trying to optimize performance I noticed that in some cases the
> bottlenecks were not in the hardware or my driver, but rather in the
> filesystem on top of it.  So maybe all this may still be useful in
> improving said filesystem.
> 
> Hardware is basically a fast SSD.  Performance tops out at about
> 650MB/s and is fairly insensitive to random access behaviour.  Latency
> is about 50us for 512B reads and near 0 for writes, through the usual
> cheating.
> 
> Numbers below were created with sysbench, using directIO.  Each block
> is a matrix with results for blocksizes from 512B to 16384B and thread
> count from 1 to 128.  Four blocks for reads and writes, both
> sequential and random.

What's the command line/script used to generate the result matrix?
And what kernel are you running on?

> xfs:
> ====
> seqrd	1	2	4	8	16	32	64	128
> 16384	4698	4424	4397	4402	4394	4398	4642	4679	
> 8192	6234	5827	5797	5801	5795	6114	5793	5812	
> 4096	9100	8835	8882	8896	8874	8890	8910	8906	
> 2048	14922	14391	14259	14248	14264	14264	14269	14273	
> 1024	23853	22690	22329	22362	22338	22277	22240	22301	
> 512	37353	33990	33292	33332	33306	33296	33224	33271	

Something is single threading completely there - something is very
wrong. Someone want to send me a nice fast pci-e SSD - my disks
don't spin that fast... :/

> rndrd	1	2	4	8	16	32	64	128
> 16384	4585	8248	14219	22533	32020	38636	39033	39054	
> 8192	6032	11186	20294	34443	53112	71228	78197	78284	
> 4096	8247	15539	29046	52090	86744	125835	154031	157143	
> 2048	11950	22652	42719	79562	140133	218092	286111	314870	
> 1024	16526	31294	59761	112494	207848	348226	483972	574403	
> 512	20635	39755	73010	130992	270648	484406	686190	726615	
> 
> seqwr	1	2	4	8	16	32	64	128
> 16384	39956	39695	39971	39913	37042	37538	36591	32179	
> 8192	67934	66073	30963	29038	29852	25210	23983	28272	
> 4096	89250	81417	28671	18685	12917	14870	22643	22237	
> 2048	140272	120588	140665	140012	137516	139183	131330	129684	
> 1024	217473	147899	210350	218526	219867	220120	219758	215166	
> 512	328260	181197	211131	263533	294009	298203	301698	298013	
> 
> rndwr	1	2	4	8	16	32	64	128
> 16384	38447	38153	38145	38140	38156	38199	38208	38236	
> 8192	78001	76965	76908	76945	77023	77174	77166	77106	
> 4096	160721	156000	157196	157084	157078	157123	156978	157149	
> 2048	325395	317148	317858	318442	318750	318981	319798	320393	
> 1024	434084	649814	650176	651820	653928	654223	655650	655818	
> 512	501067	876555	1290292	1217671	1244399	1267729	1285469	1298522	

I'm assuming that is the h/w can do 650MB/s then the numbers are in
iops? from 4 threads up all results equate to 650MB/s.

> Sequential reads are pretty horrible.  Sequential writes are hitting a
> hot lock again.

lockstat output?

> So, if anyone would like to improve one of these filesystems and needs
> more data, feel free to ping me.

Of course I'm interested. ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Filesystem benchmarks on reasonably fast hardware
       [not found]   ` <20110718075339.GB1437@logfs.org>
@ 2011-07-18 10:57     ` Dave Chinner
  2011-07-18 11:40       ` Jörn Engel
  2011-07-18 14:34       ` Jörn Engel
       [not found]     ` <20110718103956.GE1437@logfs.org>
  1 sibling, 2 replies; 21+ messages in thread
From: Dave Chinner @ 2011-07-18 10:57 UTC (permalink / raw)
  To: Jörn Engel; +Cc: linux-fsdevel

On Mon, Jul 18, 2011 at 09:53:39AM +0200, Jörn Engel wrote:
> On Mon, 18 July 2011 09:32:52 +1000, Dave Chinner wrote:
> > On Sun, Jul 17, 2011 at 06:05:01PM +0200, Jörn Engel wrote:
> > > 
> > > Numbers below were created with sysbench, using directIO.  Each block
> > > is a matrix with results for blocksizes from 512B to 16384B and thread
> > > count from 1 to 128.  Four blocks for reads and writes, both
> > > sequential and random.
> > 
> > What's the command line/script used to generate the result matrix?
> > And what kernel are you running on?
> 
> Script is attached.  Kernel is git from July 13th (51414d41).

Ok, thanks.

> > > xfs:
> > > ====
> > > seqrd	1	2	4	8	16	32	64	128
> > > 16384	4698	4424	4397	4402	4394	4398	4642	4679	
> > > 8192	6234	5827	5797	5801	5795	6114	5793	5812	
> > > 4096	9100	8835	8882	8896	8874	8890	8910	8906	
> > > 2048	14922	14391	14259	14248	14264	14264	14269	14273	
> > > 1024	23853	22690	22329	22362	22338	22277	22240	22301	
> > > 512	37353	33990	33292	33332	33306	33296	33224	33271	
> > 
> > Something is single threading completely there - something is very
> > wrong. Someone want to send me a nice fast pci-e SSD - my disks
> > don't spin that fast... :/
> 
> I wish I could just go down the shop and pick one from the
> manufacturing line. :/

Heh. At this point any old pci-e ssd would be an improvement ;)

> > > rndwr	1	2	4	8	16	32	64	128
> > > 16384	38447	38153	38145	38140	38156	38199	38208	38236	
> > > 8192	78001	76965	76908	76945	77023	77174	77166	77106	
> > > 4096	160721	156000	157196	157084	157078	157123	156978	157149	
> > > 2048	325395	317148	317858	318442	318750	318981	319798	320393	
> > > 1024	434084	649814	650176	651820	653928	654223	655650	655818	
> > > 512	501067	876555	1290292	1217671	1244399	1267729	1285469	1298522	
> > 
> > I'm assuming that is the h/w can do 650MB/s then the numbers are in
> > iops? from 4 threads up all results equate to 650MB/s.
> 
> Correct.  Writes are spread automatically across all chips.  They are
> further cached, so until every chip is busy writing, their effective
> latency is pretty much 0.  Makes for a pretty flat graph, I agree.
> 
> > > Sequential reads are pretty horrible.  Sequential writes are hitting a
> > > hot lock again.
> > 
> > lockstat output?
> 
> Attached for the bottom right case each of seqrd and seqwr.  I hope
> the filenames are descriptive enough.

Looks like you attached the seqrd lockstat twice.

> Lockstat itself hurts
> performance.  Writes were at 32245 IO/s from 298013, reads at 22458
> IO/s from 33271.  In a way we are measuring oranges to figure out why
> our apples are so small.

Yeah, but at least it points out the lock in question - the iolock.

We grab it exclusively for a very short period of time on each
direct IO read to check the page cache state, then demote it to
shared. I can see that when IO times are very short, this will, in
fact, serialise multiple readers to a single file.

A single thread shows this locking pattern:

        sysbench-3087  [000] 2192558.643146: xfs_ilock:            dev 253:0 ino 0x83 flags IOLOCK_EXCL caller xfs_rw_ilock
        sysbench-3087  [000] 2192558.643147: xfs_ilock_demote:     dev 253:0 ino 0x83 flags IOLOCK_EXCL caller T.1428
        sysbench-3087  [000] 2192558.643150: xfs_ilock:            dev 253:0 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_map_shared
        sysbench-3087  [001] 2192558.643877: xfs_ilock:            dev 253:0 ino 0x83 flags IOLOCK_EXCL caller xfs_rw_ilock
        sysbench-3087  [001] 2192558.643879: xfs_ilock_demote:     dev 253:0 ino 0x83 flags IOLOCK_EXCL caller T.1428
        sysbench-3087  [007] 2192558.643881: xfs_ilock:            dev 253:0 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_map_shared

Two threads show this:

        sysbench-3096  [005] 2192697.678308: xfs_ilock:            dev 253:0 ino 0x1c02c2 flags IOLOCK_EXCL caller xfs_rw_ilock
        sysbench-3096  [005] 2192697.678314: xfs_ilock_demote:     dev 253:0 ino 0x1c02c2 flags IOLOCK_EXCL caller T.1428
        sysbench-3096  [005] 2192697.678335: xfs_ilock:            dev 253:0 ino 0x1c02c2 flags ILOCK_SHARED caller xfs_ilock_map_shared
        sysbench-3097  [006] 2192697.678556: xfs_ilock:            dev 253:0 ino 0x1c02c2 flags IOLOCK_EXCL caller xfs_rw_ilock
        sysbench-3097  [006] 2192697.678556: xfs_ilock_demote:     dev 253:0 ino 0x1c02c2 flags IOLOCK_EXCL caller T.1428
        sysbench-3097  [006] 2192697.678577: xfs_ilock:            dev 253:0 ino 0x1c02c2 flags ILOCK_SHARED caller xfs_ilock_map_shared
        sysbench-3096  [007] 2192697.678976: xfs_ilock:            dev 253:0 ino 0x1c02c2 flags IOLOCK_EXCL caller xfs_rw_ilock
        sysbench-3096  [007] 2192697.678978: xfs_ilock_demote:     dev 253:0 ino 0x1c02c2 flags IOLOCK_EXCL caller T.1428
        sysbench-3096  [007] 2192697.679000: xfs_ilock:            dev 253:0 ino 0x1c02c2 flags ILOCK_SHARED caller xfs_ilock_map_shared

Which shows the exclusive lock on the concurrent IO serialising on
the IO in progress. Oops, that's not good.

Ok, the patch below takes the numbers on my test setup on a 16k IO
size:

seqrd	1	2	4	8	16
vanilla	3603	2798	 2563	not tested...
patches 3707	5746	10304	12875	11016

So those numbers look a lot healthier. The patch is below, 

> -- 
> Fancy algorithms are slow when n is small, and n is usually small.
> Fancy algorithms have big constants. Until you know that n is
> frequently going to be big, don't get fancy.
> -- Rob Pike

Heh. XFS always assumes n will be big. Because where XFS is used, it
just is.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: don't serialise direct IO reads on page cache checks

From: Dave Chinner <dchinner@redhat.com>

There is no need to grab the i_mutex of the IO lock in exclusive
mode if we don't need to invalidate the page cache. Taking hese
locks on every direct IO effective serialisaes them as taking the IO
lock in exclusive mode has to wait for all shared holders to drop
the lock. That only happens when IO is complete, so effective it
prevents dispatch of concurrent direct IO reads to the same inode.

Fix this by taking the IO lock shared to check the page cache state,
and only then drop it and take the IO lock exclusively if there is
work to be done. Hence for the normal direct IO case, no exclusive
locking will occur.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_file.c |   17 ++++++++++++++---
 1 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c
index 1e641e6..16a4bf0 100644
--- a/fs/xfs/linux-2.6/xfs_file.c
+++ b/fs/xfs/linux-2.6/xfs_file.c
@@ -321,7 +321,19 @@ xfs_file_aio_read(
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
 
-	if (unlikely(ioflags & IO_ISDIRECT)) {
+	/*
+	 * Locking is a bit tricky here. If we take an exclusive lock
+	 * for direct IO, we effectively serialise all new concurrent
+	 * read IO to this file and block it behind IO that is currently in
+	 * progress because IO in progress holds the IO lock shared. We only
+	 * need to hold the lock exclusive to blow away the page cache, so
+	 * only take lock exclusively if the page cache needs invalidation.
+	 * This allows the normal direct IO case of no page cache pages to
+	 * proceeed concurrently without serialisation.
+	 */
+	xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
+	if ((ioflags & IO_ISDIRECT) && inode->i_mapping->nrpages) {
+		xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
 		xfs_rw_ilock(ip, XFS_IOLOCK_EXCL);
 
 		if (inode->i_mapping->nrpages) {
@@ -334,8 +346,7 @@ xfs_file_aio_read(
 			}
 		}
 		xfs_rw_ilock_demote(ip, XFS_IOLOCK_EXCL);
-	} else
-		xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
+	}
 
 	trace_xfs_file_read(ip, size, iocb->ki_pos, ioflags);
 
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: Filesystem benchmarks on reasonably fast hardware
       [not found]     ` <20110718103956.GE1437@logfs.org>
@ 2011-07-18 11:10       ` Dave Chinner
  0 siblings, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2011-07-18 11:10 UTC (permalink / raw)
  To: Jörn Engel; +Cc: linux-fsdevel

On Mon, Jul 18, 2011 at 12:39:56PM +0200, Jörn Engel wrote:
> Write lockstat (I mistakenly sent the read one twice).

Yeah, that's the i_mutex that is the issue there. We are definitely
taking exclusive locks during the IO submission process there.

I suspect I might be able to write a patch that does all the checks
under a shared lock - similar to the patch for the read side - but
it is definitely more complex and I'll have to have a bit of a think
about it.

Thanks for the bug report!

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Filesystem benchmarks on reasonably fast hardware
  2011-07-18 10:57     ` Dave Chinner
@ 2011-07-18 11:40       ` Jörn Engel
  2011-07-19  2:41         ` Dave Chinner
  2011-07-18 14:34       ` Jörn Engel
  1 sibling, 1 reply; 21+ messages in thread
From: Jörn Engel @ 2011-07-18 11:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel

On Mon, 18 July 2011 20:57:49 +1000, Dave Chinner wrote:
> On Mon, Jul 18, 2011 at 09:53:39AM +0200, Jörn Engel wrote:
> > On Mon, 18 July 2011 09:32:52 +1000, Dave Chinner wrote:
> > > On Sun, Jul 17, 2011 at 06:05:01PM +0200, Jörn Engel wrote:
> 
> > > > xfs:
> > > > ====
> > > > seqrd	1	2	4	8	16	32	64	128
> > > > 16384	4698	4424	4397	4402	4394	4398	4642	4679	
> > > > 8192	6234	5827	5797	5801	5795	6114	5793	5812	
> > > > 4096	9100	8835	8882	8896	8874	8890	8910	8906	
> > > > 2048	14922	14391	14259	14248	14264	14264	14269	14273	
> > > > 1024	23853	22690	22329	22362	22338	22277	22240	22301	
> > > > 512	37353	33990	33292	33332	33306	33296	33224	33271	

Your patch definitely helps.  Bottom right number is 584741 now.
Still slower than ext4 or btrfs, but in the right ballpark.  Will
post the entire block once it has been generated.

Jörn

-- 
Data dominates. If you've chosen the right data structures and organized
things well, the algorithms will almost always be self-evident. Data
structures, not algorithms, are central to programming.
-- Rob Pike
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Filesystem benchmarks on reasonably fast hardware
  2011-07-17 16:05 Filesystem benchmarks on reasonably fast hardware Jörn Engel
  2011-07-17 23:32 ` Dave Chinner
@ 2011-07-18 12:07 ` Ted Ts'o
  2011-07-18 12:42   ` Jörn Engel
  2011-07-19 13:19 ` Dave Chinner
  2 siblings, 1 reply; 21+ messages in thread
From: Ted Ts'o @ 2011-07-18 12:07 UTC (permalink / raw)
  To: Jörn Engel; +Cc: linux-fsdevel

Hey Jörn,

Can you send me your script and the lockstat for ext4?

(Please cc the linux-ext4@vger.kernel.org list if you don't mind.
Thanks!!)

Thanks,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Filesystem benchmarks on reasonably fast hardware
  2011-07-18 12:07 ` Ted Ts'o
@ 2011-07-18 12:42   ` Jörn Engel
  2011-07-25 15:18     ` Ted Ts'o
  0 siblings, 1 reply; 21+ messages in thread
From: Jörn Engel @ 2011-07-18 12:42 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: linux-fsdevel, linux-ext4

[-- Attachment #1: Type: text/plain, Size: 666 bytes --]

On Mon, 18 July 2011 08:07:51 -0400, Ted Ts'o wrote:
> 
> Can you send me your script and the lockstat for ext4?

Attached.  The first script generates a bunch of files, the second
condenses them into the tabular form.  Will need some massaging to
work on anything other than my particular setup, sorry.

> (Please cc the linux-ext4@vger.kernel.org list if you don't mind.
> Thanks!!)

Sure.  Lockstat will come later today.  The machine is currently busy
regenerating xfs seqrd numbers.

Jörn

-- 
I've never met a human being who would want to read 17,000 pages of
documentation, and if there was, I'd kill him to get him out of the
gene pool.
-- Joseph Costello

[-- Attachment #2: sysbench.sh --]
[-- Type: application/x-sh, Size: 1612 bytes --]

[-- Attachment #3: sysbench_result.sh --]
[-- Type: application/x-sh, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Filesystem benchmarks on reasonably fast hardware
  2011-07-18 10:57     ` Dave Chinner
  2011-07-18 11:40       ` Jörn Engel
@ 2011-07-18 14:34       ` Jörn Engel
  1 sibling, 0 replies; 21+ messages in thread
From: Jörn Engel @ 2011-07-18 14:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel

On Mon, 18 July 2011 20:57:49 +1000, Dave Chinner wrote:
> On Mon, Jul 18, 2011 at 09:53:39AM +0200, Jörn Engel wrote:
> > On Mon, 18 July 2011 09:32:52 +1000, Dave Chinner wrote:
> > > On Sun, Jul 17, 2011 at 06:05:01PM +0200, Jörn Engel wrote:
> 
> > > > xfs:
> > > > ====
> > > > seqrd	1	2	4	8	16	32	64	128
> > > > 16384	4698	4424	4397	4402	4394	4398	4642	4679	
> > > > 8192	6234	5827	5797	5801	5795	6114	5793	5812	
> > > > 4096	9100	8835	8882	8896	8874	8890	8910	8906	
> > > > 2048	14922	14391	14259	14248	14264	14264	14269	14273	
> > > > 1024	23853	22690	22329	22362	22338	22277	22240	22301	
> > > > 512	37353	33990	33292	33332	33306	33296	33224	33271	

seqrd	1	2	4	8	16	32	64	128
16384	4542	8311	15738	28955	38273	36644	38530	38527	
8192	6000	10413	19208	33878	65927	76906	77083	77102	
4096	8931	14971	24794	44223	83512	144867	147581	150702	
2048	14375	23489	34364	56887	103053	192662	307167	309222	
1024	21647	36022	49649	77163	132886	243296	421389	497581	
512	31832	61257	79545	108782	176341	303836	517814	584741	

Quite a nice improvement for such a small patch.  As they say, "every
small factor of 17 helps". ;)

What bothers me a bit is that the single-threaded numbers took such a
noticeable hit...

> Ok, the patch below takes the numbers on my test setup on a 16k IO
> size:
> 
> seqrd	1	2	4	8	16
> vanilla	3603	2798	 2563	not tested...
> patches 3707	5746	10304	12875	11016

...in particular when your numbers improve even for a single thread.
Wonder what's going on here.

Anyway, feel free to add a Tested-By: or something from me.  And maybe
fix the two typos below.

> xfs: don't serialise direct IO reads on page cache checks
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> There is no need to grab the i_mutex of the IO lock in exclusive
> mode if we don't need to invalidate the page cache. Taking hese
                                                             ^
> locks on every direct IO effective serialisaes them as taking the IO
                                             ^
> lock in exclusive mode has to wait for all shared holders to drop
> the lock. That only happens when IO is complete, so effective it
> prevents dispatch of concurrent direct IO reads to the same inode.
> 
> Fix this by taking the IO lock shared to check the page cache state,
> and only then drop it and take the IO lock exclusively if there is
> work to be done. Hence for the normal direct IO case, no exclusive
> locking will occur.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/linux-2.6/xfs_file.c |   17 ++++++++++++++---
>  1 files changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c
> index 1e641e6..16a4bf0 100644
> --- a/fs/xfs/linux-2.6/xfs_file.c
> +++ b/fs/xfs/linux-2.6/xfs_file.c
> @@ -321,7 +321,19 @@ xfs_file_aio_read(
>  	if (XFS_FORCED_SHUTDOWN(mp))
>  		return -EIO;
>  
> -	if (unlikely(ioflags & IO_ISDIRECT)) {
> +	/*
> +	 * Locking is a bit tricky here. If we take an exclusive lock
> +	 * for direct IO, we effectively serialise all new concurrent
> +	 * read IO to this file and block it behind IO that is currently in
> +	 * progress because IO in progress holds the IO lock shared. We only
> +	 * need to hold the lock exclusive to blow away the page cache, so
> +	 * only take lock exclusively if the page cache needs invalidation.
> +	 * This allows the normal direct IO case of no page cache pages to
> +	 * proceeed concurrently without serialisation.
> +	 */
> +	xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
> +	if ((ioflags & IO_ISDIRECT) && inode->i_mapping->nrpages) {
> +		xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
>  		xfs_rw_ilock(ip, XFS_IOLOCK_EXCL);
>  
>  		if (inode->i_mapping->nrpages) {
> @@ -334,8 +346,7 @@ xfs_file_aio_read(
>  			}
>  		}
>  		xfs_rw_ilock_demote(ip, XFS_IOLOCK_EXCL);
> -	} else
> -		xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
> +	}
>  
>  	trace_xfs_file_read(ip, size, iocb->ki_pos, ioflags);
>  

Jörn

-- 
Everything should be made as simple as possible, but not simpler.
-- Albert Einstein
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Filesystem benchmarks on reasonably fast hardware
  2011-07-18 11:40       ` Jörn Engel
@ 2011-07-19  2:41         ` Dave Chinner
  2011-07-19  7:36           ` Jörn Engel
  0 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2011-07-19  2:41 UTC (permalink / raw)
  To: Jörn Engel; +Cc: linux-fsdevel

On Mon, Jul 18, 2011 at 01:40:36PM +0200, Jörn Engel wrote:
> On Mon, 18 July 2011 20:57:49 +1000, Dave Chinner wrote:
> > On Mon, Jul 18, 2011 at 09:53:39AM +0200, Jörn Engel wrote:
> > > On Mon, 18 July 2011 09:32:52 +1000, Dave Chinner wrote:
> > > > On Sun, Jul 17, 2011 at 06:05:01PM +0200, Jörn Engel wrote:
> > 
> > > > > xfs:
> > > > > ====
> > > > > seqrd	1	2	4	8	16	32	64	128
> > > > > 16384	4698	4424	4397	4402	4394	4398	4642	4679	
> > > > > 8192	6234	5827	5797	5801	5795	6114	5793	5812	
> > > > > 4096	9100	8835	8882	8896	8874	8890	8910	8906	
> > > > > 2048	14922	14391	14259	14248	14264	14264	14269	14273	
> > > > > 1024	23853	22690	22329	22362	22338	22277	22240	22301	
> > > > > 512	37353	33990	33292	33332	33306	33296	33224	33271	
> 
> Your patch definitely helps.  Bottom right number is 584741 now.
> Still slower than ext4 or btrfs, but in the right ballpark.  Will
> post the entire block once it has been generated.

The btrfs numbers are through doing different IO. have a look at all
the sub-filesystem block size numbers for btrfs. No matter the
thread count, the number is the same - hardware limits. btrfs is not
doing an IO per read syscall there - I'd say it's falling back to
buffered IO unlink ext4 and xfs....

.....

> seqrd	1	2	4	8	16	32	64	128
> 16384	4542	8311	15738	28955	38273	36644	38530	38527	
> 8192	6000	10413	19208	33878	65927	76906	77083	77102	
> 4096	8931	14971	24794	44223	83512	144867	147581	150702	
> 2048	14375	23489	34364	56887	103053	192662	307167	309222	
> 1024	21647	36022	49649	77163	132886	243296	421389	497581	
> 512	31832	61257	79545	108782	176341	303836	517814	584741	
> 
> Quite a nice improvement for such a small patch.  As they say, "every
> small factor of 17 helps". ;)

And in general the numbers are within a couple of percent of the
ext4 numbers, which is probably a reflection of the slightly higher
CPU cost of the XFS read path compared to ext4.

> What bothers me a bit is that the single-threaded numbers took such a
> noticeable hit...

Is it reproducable? I did notice quite a bit of run-to-run variation
in the numbers I ran. For single threaded numbers, they appear to be
in the order of +/-100 ops @ 16k block size.

> 
> > Ok, the patch below takes the numbers on my test setup on a 16k IO
> > size:
> > 
> > seqrd	1	2	4	8	16
> > vanilla	3603	2798	 2563	not tested...
> > patches 3707	5746	10304	12875	11016
> 
> ...in particular when your numbers improve even for a single thread.
> Wonder what's going on here.

And these were just quoted from a single test run.

> Anyway, feel free to add a Tested-By: or something from me.  And maybe
> fix the two typos below.

Will do.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Filesystem benchmarks on reasonably fast hardware
  2011-07-19  2:41         ` Dave Chinner
@ 2011-07-19  7:36           ` Jörn Engel
  2011-07-19  9:23             ` srimugunthan dhandapani
  2011-07-19 10:15             ` Dave Chinner
  0 siblings, 2 replies; 21+ messages in thread
From: Jörn Engel @ 2011-07-19  7:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel

On Tue, 19 July 2011 12:41:38 +1000, Dave Chinner wrote:
> On Mon, Jul 18, 2011 at 01:40:36PM +0200, Jörn Engel wrote:
> > On Mon, 18 July 2011 20:57:49 +1000, Dave Chinner wrote:
> > > On Mon, Jul 18, 2011 at 09:53:39AM +0200, Jörn Engel wrote:
> > > > On Mon, 18 July 2011 09:32:52 +1000, Dave Chinner wrote:
> > > > > On Sun, Jul 17, 2011 at 06:05:01PM +0200, Jörn Engel wrote:
> > > 
> > > > > > xfs:
> > > > > > ====
> > > > > > seqrd	1	2	4	8	16	32	64	128
> > > > > > 16384	4698	4424	4397	4402	4394	4398	4642	4679	
> > > > > > 8192	6234	5827	5797	5801	5795	6114	5793	5812	
> > > > > > 4096	9100	8835	8882	8896	8874	8890	8910	8906	
> > > > > > 2048	14922	14391	14259	14248	14264	14264	14269	14273	
> > > > > > 1024	23853	22690	22329	22362	22338	22277	22240	22301	
> > > > > > 512	37353	33990	33292	33332	33306	33296	33224	33271	
> 
> > seqrd	1	2	4	8	16	32	64	128
> > 16384	4542	8311	15738	28955	38273	36644	38530	38527	
> > 8192	6000	10413	19208	33878	65927	76906	77083	77102	
> > 4096	8931	14971	24794	44223	83512	144867	147581	150702	
> > 2048	14375	23489	34364	56887	103053	192662	307167	309222	
> > 1024	21647	36022	49649	77163	132886	243296	421389	497581	
> > 512	31832	61257	79545	108782	176341	303836	517814	584741	
> 
> > What bothers me a bit is that the single-threaded numbers took such a
> > noticeable hit...
> 
> Is it reproducable? I did notice quite a bit of run-to-run variation
> in the numbers I ran. For single threaded numbers, they appear to be
> in the order of +/-100 ops @ 16k block size.

Ime the numbers are stable within about 10%.  And given that out of
six measurements every single one is a regression, I would feel
confident to bet a beverage without further measurements.  Regression
is 3.4%, 3.9%, 1.9%, 3.8%, 10% and 17% respectively, so the effect
appears to be more visible with smaller block numbers as well.

Jörn

-- 
Schrödinger's cat is <BLINK>not</BLINK> dead.
-- Illiad
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Filesystem benchmarks on reasonably fast hardware
  2011-07-19  7:36           ` Jörn Engel
@ 2011-07-19  9:23             ` srimugunthan dhandapani
  2011-07-21 19:05               ` Jörn Engel
  2011-07-19 10:15             ` Dave Chinner
  1 sibling, 1 reply; 21+ messages in thread
From: srimugunthan dhandapani @ 2011-07-19  9:23 UTC (permalink / raw)
  To: Jörn Engel; +Cc: linux-fsdevel

On Tue, Jul 19, 2011 at 1:06 PM, Jörn Engel <joern@logfs.org> wrote:
> On Tue, 19 July 2011 12:41:38 +1000, Dave Chinner wrote:
>> On Mon, Jul 18, 2011 at 01:40:36PM +0200, Jörn Engel wrote:
>> > On Mon, 18 July 2011 20:57:49 +1000, Dave Chinner wrote:
>> > > On Mon, Jul 18, 2011 at 09:53:39AM +0200, Jörn Engel wrote:
>> > > > On Mon, 18 July 2011 09:32:52 +1000, Dave Chinner wrote:
>> > > > > On Sun, Jul 17, 2011 at 06:05:01PM +0200, Jörn Engel wrote:
>> > >
>> > > > > > xfs:
>> > > > > > ====
>> > > > > > seqrd       1       2       4       8       16      32      64      128
>> > > > > > 16384       4698    4424    4397    4402    4394    4398    4642    4679
>> > > > > > 8192        6234    5827    5797    5801    5795    6114    5793    5812
>> > > > > > 4096        9100    8835    8882    8896    8874    8890    8910    8906
>> > > > > > 2048        14922   14391   14259   14248   14264   14264   14269   14273
>> > > > > > 1024        23853   22690   22329   22362   22338   22277   22240   22301
>> > > > > > 512 37353   33990   33292   33332   33306   33296   33224   33271
>>
>> > seqrd       1       2       4       8       16      32      64      128
>> > 16384       4542    8311    15738   28955   38273   36644   38530   38527
>> > 8192        6000    10413   19208   33878   65927   76906   77083   77102
>> > 4096        8931    14971   24794   44223   83512   144867  147581  150702
>> > 2048        14375   23489   34364   56887   103053  192662  307167  309222
>> > 1024        21647   36022   49649   77163   132886  243296  421389  497581
>> > 512 31832   61257   79545   108782  176341  303836  517814  584741
>>
>> > What bothers me a bit is that the single-threaded numbers took such a
>> > noticeable hit...
>>
>> Is it reproducable? I did notice quite a bit of run-to-run variation
>> in the numbers I ran. For single threaded numbers, they appear to be
>> in the order of +/-100 ops @ 16k block size.
>
> Ime the numbers are stable within about 10%.  And given that out of
> six measurements every single one is a regression, I would feel
> confident to bet a beverage without further measurements.  Regression
> is 3.4%, 3.9%, 1.9%, 3.8%, 10% and 17% respectively, so the effect
> appears to be more visible with smaller block numbers as well.
>
> Jörn

Hi Joern
Is the hardware the "Drais card" that you described in the following link
www.linux-kongress.org/2010/slides/logfs-engel.pdf
Since the driver exposes an mtd device, do you mount the ext4,btrfs
filesystem over any FTL?

Is it possible to have logfs over the PCIe-SSD card?

Pardon me for asking the following in this thread.
I have been trying to mount logfs and i face  seg fault during unmount
. I have tested it in 2.6.34 and 2.39.1. I have asked about the
problem here.
http://comments.gmane.org/gmane.linux.file-systems/55008

Two other people have also faced umount problem in logfs

1. http://comments.gmane.org/gmane.linux.file-systems/46630
2. http://eeek.borgchat.net/lists/linux-embedded/msg02970.html

My apologies again for asking it here. Since the logfs@logfs.org
mailing list(and the wiki) doesnt work any more , i am asking the
question here. I am thankful for your reply.
Thanks,
mugunthan
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Filesystem benchmarks on reasonably fast hardware
  2011-07-19  7:36           ` Jörn Engel
  2011-07-19  9:23             ` srimugunthan dhandapani
@ 2011-07-19 10:15             ` Dave Chinner
  1 sibling, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2011-07-19 10:15 UTC (permalink / raw)
  To: Jörn Engel; +Cc: linux-fsdevel

On Tue, Jul 19, 2011 at 09:36:33AM +0200, Jörn Engel wrote:
> On Tue, 19 July 2011 12:41:38 +1000, Dave Chinner wrote:
> > On Mon, Jul 18, 2011 at 01:40:36PM +0200, Jörn Engel wrote:
> > > On Mon, 18 July 2011 20:57:49 +1000, Dave Chinner wrote:
> > > > On Mon, Jul 18, 2011 at 09:53:39AM +0200, Jörn Engel wrote:
> > > > > On Mon, 18 July 2011 09:32:52 +1000, Dave Chinner wrote:
> > > > > > On Sun, Jul 17, 2011 at 06:05:01PM +0200, Jörn Engel wrote:
> > > > 
> > > > > > > xfs:
> > > > > > > ====
> > > > > > > seqrd	1	2	4	8	16	32	64	128
> > > > > > > 16384	4698	4424	4397	4402	4394	4398	4642	4679	
> > > > > > > 8192	6234	5827	5797	5801	5795	6114	5793	5812	
> > > > > > > 4096	9100	8835	8882	8896	8874	8890	8910	8906	
> > > > > > > 2048	14922	14391	14259	14248	14264	14264	14269	14273	
> > > > > > > 1024	23853	22690	22329	22362	22338	22277	22240	22301	
> > > > > > > 512	37353	33990	33292	33332	33306	33296	33224	33271	
> > 
> > > seqrd	1	2	4	8	16	32	64	128
> > > 16384	4542	8311	15738	28955	38273	36644	38530	38527	
> > > 8192	6000	10413	19208	33878	65927	76906	77083	77102	
> > > 4096	8931	14971	24794	44223	83512	144867	147581	150702	
> > > 2048	14375	23489	34364	56887	103053	192662	307167	309222	
> > > 1024	21647	36022	49649	77163	132886	243296	421389	497581	
> > > 512	31832	61257	79545	108782	176341	303836	517814	584741	
> > 
> > > What bothers me a bit is that the single-threaded numbers took such a
> > > noticeable hit...
> > 
> > Is it reproducable? I did notice quite a bit of run-to-run variation
> > in the numbers I ran. For single threaded numbers, they appear to be
> > in the order of +/-100 ops @ 16k block size.
> 
> Ime the numbers are stable within about 10%.  And given that out of
> six measurements every single one is a regression, I would feel
> confident to bet a beverage without further measurements.  Regression
> is 3.4%, 3.9%, 1.9%, 3.8%, 10% and 17% respectively, so the effect
> appears to be more visible with smaller block numbers as well.

Only thing I can think of then is that taking the lock shared is
more expensive than taking it exclusive. Otherwise there is little
change to the code path....

/me shrugs and cares not all that much right now

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Filesystem benchmarks on reasonably fast hardware
  2011-07-17 16:05 Filesystem benchmarks on reasonably fast hardware Jörn Engel
  2011-07-17 23:32 ` Dave Chinner
  2011-07-18 12:07 ` Ted Ts'o
@ 2011-07-19 13:19 ` Dave Chinner
  2011-07-21 10:42   ` Jörn Engel
  2 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2011-07-19 13:19 UTC (permalink / raw)
  To: Jörn Engel; +Cc: linux-fsdevel

On Sun, Jul 17, 2011 at 06:05:01PM +0200, Jörn Engel wrote:
> xfs:
> ====
.....
> seqwr	1	2	4	8	16	32	64	128
> 16384	39956	39695	39971	39913	37042	37538	36591	32179	
> 8192	67934	66073	30963	29038	29852	25210	23983	28272	
> 4096	89250	81417	28671	18685	12917	14870	22643	22237	
> 2048	140272	120588	140665	140012	137516	139183	131330	129684	
> 1024	217473	147899	210350	218526	219867	220120	219758	215166	
> 512	328260	181197	211131	263533	294009	298203	301698	298013	

OK, I can explain the pattern here where throughput drops off at 2-4
threads. It's not as simple as the seqrd case, but it's related to
the fact that this workload is an append write workload. See the
patch description below for why that matters.

As it is, the numbers I get for 16k seqwr on my hardawre are as
follows:

seqwr	1	2	4	8	16
vanilla	3072	2734	2506	not tested...
patched 2984	4156	4922	5175	5120

Looks like my hardware is topping out at ~5-6kiops no matter the
block size here. Which, no matter how you look at it, is a
significant improvement. ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


xfs: don't serialise adjacent concurrent direct IO appending writes

For append write workloads, extending the file requires a certain
amount of exclusive locking to be done up front to ensure sanity in
things like ensuring that we've zeroed any allocated regions
between the old EOF and the start of the new IO.

For single threads, this typically isn't a problem, and for large
IOs we don't serialise enough for it to be a problem for two
threads on really fast block devices. However for smaller IO and
larger thread counts we have a problem.

Take 4 concurrent sequential, single block sized and aligned IOs.
After the first IO is submitted but before it completes, we end up
with this state:

        IO 1    IO 2    IO 3    IO 4
      +-------+-------+-------+-------+
      ^       ^
      |       |
      |       |
      |       |
      |       \- ip->i_new_size
      \- ip->i_size

And the IO is done without exclusive locking because offset <=
ip->i_size. When we submit IO 2, we see offset > ip->i_size, and
grab the IO lock exclusive, because there is a chance we need to do
EOF zeroing. However, there is already an IO in progress that avoids
the need for IO zeroing because offset <= ip->i_new_size. hence we
could avoid holding the IO lock exlcusive for this. Hence after
submission of the second IO, we'd end up this state:

        IO 1    IO 2    IO 3    IO 4
      +-------+-------+-------+-------+
      ^               ^
      |               |
      |               |
      |               |
      |               \- ip->i_new_size
      \- ip->i_size

There is no need to grab the i_mutex of the IO lock in exclusive
mode if we don't need to invalidate the page cache. Taking these
locks on every direct IO effective serialises them as taking the IO
lock in exclusive mode has to wait for all shared holders to drop
the lock. That only happens when IO is complete, so effective it
prevents dispatch of concurrent direct IO writes to the same inode.

And so you can see that for the third concurrent IO, we'd avoid
exclusive locking for the same reason we avoided the exclusive lock
for the second IO.

Fixing this is a bit more complex than that, because we need to hold
a write-submission local value of ip->i_new_size to that clearing
the value is only done if no other thread has updated it before our
IO completes.....

Signed-off-by: Dave Chinner <dchinner@redhat.com>

---
 fs/xfs/linux-2.6/xfs_aops.c |    7 ++++
 fs/xfs/linux-2.6/xfs_file.c |   69 ++++++++++++++++++++++++++++++++++---------
 2 files changed, 62 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index 63e971e..dda9a9e 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -176,6 +176,13 @@ xfs_setfilesize(
 	if (unlikely(ioend->io_error))
 		return 0;
 
+	/*
+	 * If the IO is clearly not beyond the on-disk inode size,
+	 * return before we take locks.
+	 */
+	if (ioend->io_offset + ioend->io_size <= ip->i_d.di_size)
+		return 0;
+
 	if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
 		return EAGAIN;
 
diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c
index 16a4bf0..5b6703a 100644
--- a/fs/xfs/linux-2.6/xfs_file.c
+++ b/fs/xfs/linux-2.6/xfs_file.c
@@ -422,11 +422,13 @@ xfs_aio_write_isize_update(
  */
 STATIC void
 xfs_aio_write_newsize_update(
-	struct xfs_inode	*ip)
+	struct xfs_inode	*ip,
+	xfs_fsize_t		new_size)
 {
-	if (ip->i_new_size) {
+	if (new_size == ip->i_new_size) {
 		xfs_rw_ilock(ip, XFS_ILOCK_EXCL);
-		ip->i_new_size = 0;
+		if (new_size == ip->i_new_size)
+			ip->i_new_size = 0;
 		if (ip->i_d.di_size > ip->i_size)
 			ip->i_d.di_size = ip->i_size;
 		xfs_rw_iunlock(ip, XFS_ILOCK_EXCL);
@@ -478,7 +480,7 @@ xfs_file_splice_write(
 						count, flags);
 
 	xfs_aio_write_isize_update(inode, ppos, ret);
-	xfs_aio_write_newsize_update(ip);
+	xfs_aio_write_newsize_update(ip, new_size);
 	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
 	return ret;
 }
@@ -675,6 +677,7 @@ xfs_file_aio_write_checks(
 	struct file		*file,
 	loff_t			*pos,
 	size_t			*count,
+	xfs_fsize_t		*new_sizep,
 	int			*iolock)
 {
 	struct inode		*inode = file->f_mapping->host;
@@ -682,6 +685,8 @@ xfs_file_aio_write_checks(
 	xfs_fsize_t		new_size;
 	int			error = 0;
 
+restart:
+	*new_sizep = 0;
 	error = generic_write_checks(file, pos, count, S_ISBLK(inode->i_mode));
 	if (error) {
 		xfs_rw_iunlock(ip, XFS_ILOCK_EXCL | *iolock);
@@ -689,9 +694,18 @@ xfs_file_aio_write_checks(
 		return error;
 	}
 
+	/*
+	 * if we are writing beyond the current EOF, only update the
+	 * ip->i_new_size if it is larger than any other concurrent write beyond
+	 * EOF. Regardless of whether we update ip->i_new_size, return the
+	 * updated new_size to the caller.
+	 */
 	new_size = *pos + *count;
-	if (new_size > ip->i_size)
-		ip->i_new_size = new_size;
+	if (new_size > ip->i_size) {
+		if (new_size > ip->i_new_size)
+			ip->i_new_size = new_size;
+		*new_sizep = new_size;
+	}
 
 	if (likely(!(file->f_mode & FMODE_NOCMTIME)))
 		file_update_time(file);
@@ -699,10 +713,22 @@ xfs_file_aio_write_checks(
 	/*
 	 * If the offset is beyond the size of the file, we need to zero any
 	 * blocks that fall between the existing EOF and the start of this
-	 * write.
+	 * write. Don't issue zeroing if this IO is adjacent to an IO already in
+	 * flight. If we are currently holding the iolock shared, we need to
+	 * update it to exclusive which involves dropping all locks and
+	 * relocking to maintain correct locking order. If we do this, restart
+	 * the function to ensure all checks and values are still valid.
 	 */
-	if (*pos > ip->i_size)
+	if ((ip->i_new_size && *pos > ip->i_new_size) ||
+	    (!ip->i_new_size && *pos > ip->i_size)) {
+		if (*iolock == XFS_IOLOCK_SHARED) {
+			xfs_rw_iunlock(ip, XFS_ILOCK_EXCL | *iolock);
+			*iolock = XFS_IOLOCK_EXCL;
+			xfs_rw_ilock(ip, XFS_ILOCK_EXCL | *iolock);
+			goto restart;
+		}
 		error = -xfs_zero_eof(ip, *pos, ip->i_size);
+	}
 
 	xfs_rw_iunlock(ip, XFS_ILOCK_EXCL);
 	if (error)
@@ -749,6 +775,7 @@ xfs_file_dio_aio_write(
 	unsigned long		nr_segs,
 	loff_t			pos,
 	size_t			ocount,
+	xfs_fsize_t		*new_size,
 	int			*iolock)
 {
 	struct file		*file = iocb->ki_filp;
@@ -769,13 +796,25 @@ xfs_file_dio_aio_write(
 	if ((pos & mp->m_blockmask) || ((pos + count) & mp->m_blockmask))
 		unaligned_io = 1;
 
-	if (unaligned_io || mapping->nrpages || pos > ip->i_size)
+	/*
+	 * Tricky locking alert: if we are doing multiple concurrent sequential
+	 * writes (e.g. via aio), we don't need to do EOF zeroing if the current
+	 * IO is adjacent to an in-flight IO. That means for such IO we can
+	 * avoid taking the IOLOCK exclusively. Hence we avoid checking for
+	 * writes beyond EOF at this point when deciding what lock to take.
+	 * We will take the IOLOCK exclusive later if necessary.
+	 *
+	 * This, however, means that we need a local copy of the ip->i_new_size
+	 * value from this IO if we change it so that we can determine if we can
+	 * clear the value from the inode when this IO completes.
+	 */
+	if (unaligned_io || mapping->nrpages)
 		*iolock = XFS_IOLOCK_EXCL;
 	else
 		*iolock = XFS_IOLOCK_SHARED;
 	xfs_rw_ilock(ip, XFS_ILOCK_EXCL | *iolock);
 
-	ret = xfs_file_aio_write_checks(file, &pos, &count, iolock);
+	ret = xfs_file_aio_write_checks(file, &pos, &count, new_size, iolock);
 	if (ret)
 		return ret;
 
@@ -814,6 +853,7 @@ xfs_file_buffered_aio_write(
 	unsigned long		nr_segs,
 	loff_t			pos,
 	size_t			ocount,
+	xfs_fsize_t		*new_size,
 	int			*iolock)
 {
 	struct file		*file = iocb->ki_filp;
@@ -827,7 +867,7 @@ xfs_file_buffered_aio_write(
 	*iolock = XFS_IOLOCK_EXCL;
 	xfs_rw_ilock(ip, XFS_ILOCK_EXCL | *iolock);
 
-	ret = xfs_file_aio_write_checks(file, &pos, &count, iolock);
+	ret = xfs_file_aio_write_checks(file, &pos, &count, new_size, iolock);
 	if (ret)
 		return ret;
 
@@ -867,6 +907,7 @@ xfs_file_aio_write(
 	ssize_t			ret;
 	int			iolock;
 	size_t			ocount = 0;
+	xfs_fsize_t		new_size = 0;
 
 	XFS_STATS_INC(xs_write_calls);
 
@@ -886,10 +927,10 @@ xfs_file_aio_write(
 
 	if (unlikely(file->f_flags & O_DIRECT))
 		ret = xfs_file_dio_aio_write(iocb, iovp, nr_segs, pos,
-						ocount, &iolock);
+						ocount, &new_size, &iolock);
 	else
 		ret = xfs_file_buffered_aio_write(iocb, iovp, nr_segs, pos,
-						ocount, &iolock);
+						ocount, &new_size, &iolock);
 
 	xfs_aio_write_isize_update(inode, &iocb->ki_pos, ret);
 
@@ -914,7 +955,7 @@ xfs_file_aio_write(
 	}
 
 out_unlock:
-	xfs_aio_write_newsize_update(ip);
+	xfs_aio_write_newsize_update(ip, new_size);
 	xfs_rw_iunlock(ip, iolock);
 	return ret;
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: Filesystem benchmarks on reasonably fast hardware
  2011-07-19 13:19 ` Dave Chinner
@ 2011-07-21 10:42   ` Jörn Engel
  2011-07-22 18:51     ` Jörn Engel
  0 siblings, 1 reply; 21+ messages in thread
From: Jörn Engel @ 2011-07-21 10:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel

On Tue, 19 July 2011 23:19:58 +1000, Dave Chinner wrote:
> On Sun, Jul 17, 2011 at 06:05:01PM +0200, Jörn Engel wrote:
> > xfs:
> > ====
> .....
> > seqwr	1	2	4	8	16	32	64	128
> > 16384	39956	39695	39971	39913	37042	37538	36591	32179	
> > 8192	67934	66073	30963	29038	29852	25210	23983	28272	
> > 4096	89250	81417	28671	18685	12917	14870	22643	22237	
> > 2048	140272	120588	140665	140012	137516	139183	131330	129684	
> > 1024	217473	147899	210350	218526	219867	220120	219758	215166	
> > 512	328260	181197	211131	263533	294009	298203	301698	298013	
> 
> OK, I can explain the pattern here where throughput drops off at 2-4
> threads. It's not as simple as the seqrd case, but it's related to
> the fact that this workload is an append write workload. See the
> patch description below for why that matters.
> 
> As it is, the numbers I get for 16k seqwr on my hardawre are as
> follows:
> 
> seqwr	1	2	4	8	16
> vanilla	3072	2734	2506	not tested...
> patched 2984	4156	4922	5175	5120
> 
> Looks like my hardware is topping out at ~5-6kiops no matter the
> block size here. Which, no matter how you look at it, is a
> significant improvement. ;)

My numbers include some regressions, although the improvements clearly
dominate.  Below is a diff (or div) between new kernel with both your
patches applied and vanilla.  >1 means improvement, <1 means
regression.

seqrd	1	2	4	8	16	32	64	128
16384	1.037	1.975	3.726	6.643	8.901	8.902	8.431	8.365	
8192	1.015	1.871	3.459	6.424	11.457	12.829	13.542	13.490	
4096	1.009	1.790	2.942	5.179	9.634	16.667	17.652	17.666	
2048	1.005	1.709	2.525	4.196	7.479	14.022	22.032	22.100	
1024	1.017	1.624	2.328	3.587	6.112	11.365	20.311	21.315	
512	1.012	1.829	2.374	3.365	5.352	9.459	16.809	18.771	

rndrd	1	2	4	8	16	32	64	128
16384	1.042	1.037	1.036	1.043	1.051	1.011	1.002	1.001	
8192	1.020	1.020	1.028	1.040	1.057	1.064	1.002	1.001	
4096	1.011	1.007	1.021	1.036	1.059	1.086	1.021	1.001	
2048	1.002	1.010	1.018	1.025	1.057	1.100	1.098	1.003	
1024	1.001	1.002	1.023	1.007	1.072	1.112	1.162	1.102	
512	0.998	1.010	1.004	1.035	1.088	1.121	1.156	1.127	

seqwr	1	2	4	8	16	32	64	128
16384	0.942	0.949	0.942	0.945	1.017	1.004	1.030	1.172	
8192	1.144	1.177	2.517	2.687	2.611	3.091	3.246	2.741	
4096	1.389	1.506	4.228	6.443	9.313	8.064	5.276	5.394	
2048	1.139	1.278	1.080	1.076	1.094	1.087	1.142	1.148	
1024	0.852	1.190	0.806	0.783	0.776	0.774	0.769	0.774	
512	0.709	1.273	1.055	0.847	0.758	0.744	0.738	0.746	

rndwr	1	2	4	8	16	32	64	128
16384	1.013	1.003	1.002	1.005	1.007	1.006	1.003	1.002	
8192	1.023	1.005	1.007	1.006	1.006	1.004	1.004	1.006	
4096	1.020	1.007	1.007	1.007	1.007	1.007	1.008	1.007	
2048	0.901	1.017	1.007	1.008	1.008	1.009	1.008	1.007	
1024	0.848	0.949	1.003	0.990	1.001	1.006	1.006	1.005	
512	0.821	0.833	0.948	0.956	0.935	0.929	0.921	0.914	


Raw results:

seqrd	1	2	4	8	16	32	64	128
16384	4873	8738	16382	29241	39111	39152	39137	39140	
8192	6326	10900	20054	37263	66391	78437	78449	78404	
4096	9181	15816	26130	46073	85492	148172	157276	157329	
2048	14995	24588	36009	59790	106685	200012	314373	315440	
1024	24248	36841	51972	80207	136529	253175	451709	475353	
512	37813	62164	79048	112175	178246	314959	558458	624534	

rndrd	1	2	4	8	16	32	64	128
16384	4778	8554	14724	23507	33666	39065	39109	39104	
8192	6152	11409	20862	35814	56123	75776	78370	78380	
4096	8335	15643	29662	53953	91867	136643	157314	157325	
2048	11973	22885	43474	81545	148087	239997	314198	315680	
1024	16547	31345	61123	113283	222737	387234	562457	632767	
512	20590	40134	73333	135621	294448	543117	793329	818861	

seqwr	1	2	4	8	16	32	64	128
16384	37629	37651	37667	37711	37658	37674	37687	37727	
8192	77691	77747	77948	78017	77940	77931	77847	77488	
4096	123997	122607	121219	120394	120301	119908	119457	119939	
2048	159816	154063	151987	150608	150449	151298	150016	148852	
1024	185215	175977	169562	171078	170649	170420	169076	166614	
512	232890	230669	222830	223140	222877	221812	222588	222369	

rndwr	1	2	4	8	16	32	64	128
16384	38944	38256	38227	38312	38438	38432	38331	38313	
8192	79773	77378	77453	77425	77473	77500	77458	77535	
4096	163925	157167	158258	158192	158244	158281	158229	158252	
2048	293295	322480	320206	321022	321375	321926	322298	322558	
1024	368010	616516	652359	645514	654715	658132	659513	659125	
512	411236	730015	1223437	1164632	1163705	1178235	1184450	1186594	

Jörn

-- 
Ninety percent of everything is crap.
-- Sturgeon's Law
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Filesystem benchmarks on reasonably fast hardware
  2011-07-19  9:23             ` srimugunthan dhandapani
@ 2011-07-21 19:05               ` Jörn Engel
  0 siblings, 0 replies; 21+ messages in thread
From: Jörn Engel @ 2011-07-21 19:05 UTC (permalink / raw)
  To: srimugunthan dhandapani; +Cc: linux-fsdevel

On Tue, 19 July 2011 14:53:08 +0530, srimugunthan dhandapani wrote:
> 
> Is the hardware the "Drais card" that you described in the following link
> www.linux-kongress.org/2010/slides/logfs-engel.pdf

Yes.

> Since the driver exposes an mtd device, do you mount the ext4,btrfs
> filesystem over any FTL?

That was last year.  In the mean time I've added an FTL to the driver,
so the card behaves like a regular ssd.  Well, mostly.

> Is it possible to have logfs over the PCIe-SSD card?

YeaaaNo!  Not anymore.  Could be lack of error correction in the
current driver or could be bitrot.  Logfs over loopback seems to work
just fine, so if it is bitrot, it is limited to the mtd interface.

> Pardon me for asking the following in this thread.
> I have been trying to mount logfs and i face  seg fault during unmount
> . I have tested it in 2.6.34 and 2.39.1. I have asked about the
> problem here.
> http://comments.gmane.org/gmane.linux.file-systems/55008
> 
> Two other people have also faced umount problem in logfs
> 
> 1. http://comments.gmane.org/gmane.linux.file-systems/46630
> 2. http://eeek.borgchat.net/lists/linux-embedded/msg02970.html
> 
> My apologies again for asking it here. Since the logfs@logfs.org
> mailing list(and the wiki) doesnt work any more , i am asking the
> question here. I am thankful for your reply.

Yes, ever since that machine died I have basically been the
non-maintainer of logfs.  In a different century I would have been
hanged, drawn and quartered for it.  Give me some time to test the mtd
side and see what's up.

Jörn

-- 
Write programs that do one thing and do it well. Write programs to work
together. Write programs to handle text streams, because that is a
universal interface.
-- Doug MacIlroy
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Filesystem benchmarks on reasonably fast hardware
  2011-07-21 10:42   ` Jörn Engel
@ 2011-07-22 18:51     ` Jörn Engel
  0 siblings, 0 replies; 21+ messages in thread
From: Jörn Engel @ 2011-07-22 18:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel

On Thu, 21 July 2011 12:42:46 +0200, Jörn Engel wrote:
> On Tue, 19 July 2011 23:19:58 +1000, Dave Chinner wrote:
> > On Sun, Jul 17, 2011 at 06:05:01PM +0200, Jörn Engel wrote:
> 
> [ Crap ]

I had tested ext4 with two xfs patches.  Try these numbers instead.
Both patches have my endorsement.  Excellent work!


seqrd	1	2	4	8	16	32	64	128
16384	1.000	1.880	3.456	6.297	8.727	8.703	8.271	8.208	
8192	1.001	1.811	3.304	6.153	10.061	12.567	13.248	12.077	
4096	1.001	1.752	2.832	4.968	9.199	15.937	17.228	17.139	
2048	1.001	1.689	2.459	4.053	7.152	13.241	21.565	21.694	
1024	1.011	1.619	2.296	3.521	5.935	10.849	19.649	27.848	
512	1.008	1.825	2.371	3.310	5.230	9.146	16.591	27.234	

rndrd	1	2	4	8	16	32	64	128
16384	1.003	1.005	1.009	1.021	1.032	1.009	1.001	1.001	
8192	1.002	1.004	1.013	1.024	1.041	1.051	1.001	1.001	
4096	1.003	1.004	1.013	1.027	1.049	1.071	1.020	1.000	
2048	1.004	1.010	1.019	1.011	1.052	1.091	1.091	1.002	
1024	1.003	1.009	1.028	1.027	1.068	1.109	1.155	1.099	
512	1.002	1.014	1.016	1.044	1.083	1.125	1.196	1.236	

seqwr	1	2	4	8	16	32	64	128
16384	1.003	1.001	0.981	0.953	0.995	0.947	1.057	1.203	
8192	0.999	1.048	2.120	2.060	1.799	1.991	2.093	1.998	
4096	0.991	1.074	2.901	3.878	5.218	4.030	2.358	2.601	
2048	1.005	1.273	1.058	1.077	1.112	1.123	1.137	1.161	
1024	0.999	1.605	1.147	1.059	1.059	1.047	1.064	1.069	
512	0.947	1.978	1.618	1.317	1.181	1.156	1.149	1.134	

rndwr	1	2	4	8	16	32	64	128
16384	1.000	0.999	1.000	1.001	1.000	1.000	1.001	0.999	
8192	0.999	1.000	1.000	1.001	1.000	1.001	1.001	1.003	
4096	0.997	0.998	1.000	1.000	1.001	1.000	1.001	1.000	
2048	1.002	1.001	1.001	1.003	1.001	1.002	1.000	1.000	
1024	0.998	1.001	1.000	1.001	1.000	1.001	0.999	1.001	
512	1.044	0.999	1.003	1.001	1.001	1.001	1.002	0.998	


seqrd	1	2	4	8	16	32	64	128
16384	4700	8316	15197	27721	38348	38277	38394	38406	
8192	6241	10551	19156	35692	58304	76835	76743	70192	
4096	9110	15477	25155	44196	81632	141681	153499	152642	
2048	14942	24309	35063	57754	102009	188865	307705	309641	
1024	24104	36724	51278	78737	132577	241681	437003	621032	
512	37646	62022	78943	110334	174203	304532	551212	906087	

rndrd	1	2	4	8	16	32	64	128
16384	4598	8288	14352	22999	33051	38977	39072	39086	
8192	6042	11233	20566	35279	55300	74863	78278	78359	
4096	8268	15604	29428	53514	91016	134799	157045	157144	
2048	11997	22877	43550	80430	147372	237967	312170	315369	
1024	16578	31577	61419	115548	221986	386119	558797	631441	
512	20668	40293	74185	136774	293068	545050	820771	897897	

seqwr	1	2	4	8	16	32	64	128
16384	40074	39718	39198	38027	36846	35562	38659	38726	
8192	67896	69240	65628	59807	53713	50181	50208	56486	
4096	88439	87416	83167	72468	67401	59932	53383	57845	
2048	141003	153543	148813	150740	152966	156238	149370	150576	
1024	217311	237402	241186	231341	232902	230429	233877	230095	
512	310980	358427	341578	347183	347281	344722	346779	337970	

rndwr	1	2	4	8	16	32	64	128
16384	38436	38112	38154	38161	38174	38208	38250	38197	
8192	77890	76972	76938	76993	77031	77255	77274	77301	
4096	160246	155612	157142	157090	157213	157081	157193	157160	
2048	326008	317372	318089	319273	318994	319596	319773	320299	
1024	433107	650226	649868	652195	653764	654760	655299	656246	
512	523091	875267	1294281	1218935	1245993	1269267	1287429	1296046	

Jörn

-- 
Fools ignore complexity.  Pragmatists suffer it.
Some can avoid it.  Geniuses remove it.
-- Perlis's Programming Proverb #58, SIGPLAN Notices, Sept.  1982
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Filesystem benchmarks on reasonably fast hardware
  2011-07-18 12:42   ` Jörn Engel
@ 2011-07-25 15:18     ` Ted Ts'o
  2011-07-25 18:20       ` Jörn Engel
  0 siblings, 1 reply; 21+ messages in thread
From: Ted Ts'o @ 2011-07-25 15:18 UTC (permalink / raw)
  To: Jörn Engel; +Cc: linux-fsdevel, linux-ext4

On Mon, Jul 18, 2011 at 02:42:29PM +0200, Jörn Engel wrote:
> On Mon, 18 July 2011 08:07:51 -0400, Ted Ts'o wrote:
> > 
> > Can you send me your script and the lockstat for ext4?
> 
> Attached.  The first script generates a bunch of files, the second
> condenses them into the tabular form.  Will need some massaging to
> work on anything other than my particular setup, sorry.
> 
> > (Please cc the linux-ext4@vger.kernel.org list if you don't mind.
> > Thanks!!)
> 
> Sure.  Lockstat will come later today.  The machine is currently busy
> regenerating xfs seqrd numbers.

Hi Jörn,

Did you have a chance to do an ext4 lockstat run?

Many thanks!!

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Filesystem benchmarks on reasonably fast hardware
  2011-07-25 15:18     ` Ted Ts'o
@ 2011-07-25 18:20       ` Jörn Engel
  2011-07-25 21:18         ` Ted Ts'o
  2011-07-26 14:57         ` Ted Ts'o
  0 siblings, 2 replies; 21+ messages in thread
From: Jörn Engel @ 2011-07-25 18:20 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: linux-fsdevel, linux-ext4

On Mon, 25 July 2011 11:18:25 -0400, Ted Ts'o wrote:
> 
> Did you have a chance to do an ext4 lockstat run?

Yes, I did.  But your mails keep bouncing, so you have to look at the
list to see it (or this mail).  Yes, I lack a proper reverse DNS
record, as the IP belongs to my provider, not me.  Most people don't
care, some bounce, some silently ignore my mail.  The joys of spam
filtering.

Jörn

-- 
The rabbit runs faster than the fox, because the rabbit is rinning for
his life while the fox is only running for his dinner.
-- Aesop
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Filesystem benchmarks on reasonably fast hardware
  2011-07-25 18:20       ` Jörn Engel
@ 2011-07-25 21:18         ` Ted Ts'o
  2011-07-26 14:57         ` Ted Ts'o
  1 sibling, 0 replies; 21+ messages in thread
From: Ted Ts'o @ 2011-07-25 21:18 UTC (permalink / raw)
  To: Jörn Engel; +Cc: linux-fsdevel, linux-ext4

On Mon, Jul 25, 2011 at 08:20:37PM +0200, Jörn Engel wrote:
> On Mon, 25 July 2011 11:18:25 -0400, Ted Ts'o wrote:
> > 
> > Did you have a chance to do an ext4 lockstat run?
> 
> Yes, I did.  But your mails keep bouncing, so you have to look at the
> list to see it (or this mail).  Yes, I lack a proper reverse DNS
> record, as the IP belongs to my provider, not me.  Most people don't
> care, some bounce, some silently ignore my mail.  The joys of spam
> filtering.

I didn't see the ext4 lockstat on the list.  Can you resend it to
tytso@google.com or theodore.tso@gmail.com?  MIT is using an
outsourced SPAM provider (Brightmail anti-spam), and I can't do
anything about that, unfortunately.  From what I can tell the
Brightmail doesn't drop all e-mails from non-resolving IP's, but if
it's in a "bad neighborhood" (i.e., your neighbors are all spammers,
or belong to Windows users where 80% of the machines are spambots),
Brightmail is probably going to flag your mail as spam.  :-(

Thanks!

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Filesystem benchmarks on reasonably fast hardware
  2011-07-25 18:20       ` Jörn Engel
  2011-07-25 21:18         ` Ted Ts'o
@ 2011-07-26 14:57         ` Ted Ts'o
  2011-07-27  3:39           ` Yongqiang Yang
  1 sibling, 1 reply; 21+ messages in thread
From: Ted Ts'o @ 2011-07-26 14:57 UTC (permalink / raw)
  To: Jörn Engel; +Cc: linux-fsdevel, linux-ext4

On Mon, Jul 25, 2011 at 08:20:37PM +0200, Jörn Engel wrote:
> On Mon, 25 July 2011 11:18:25 -0400, Ted Ts'o wrote:
> > 
> > Did you have a chance to do an ext4 lockstat run?

Hi Jörn,

Thanks for forwarding it to me.  It's the same problem as in XFS, the
excessive coverage of the i_mutex lock.  In ext4's case, it's in the
generic generic_file_aio_write() machinery where we need to do the
lock busting.  (XFS apparently doesn't use the generic routines, so
the fix that Dave did won't help ext3 and ext4.)

I don't have the time to look at it now, but I'll put it on my todo
list; or maybe someone with a bit more time can look into how we might
be able to use a similar approach in the generic file system code.

   	       	 	 	     	 	 - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Filesystem benchmarks on reasonably fast hardware
  2011-07-26 14:57         ` Ted Ts'o
@ 2011-07-27  3:39           ` Yongqiang Yang
  0 siblings, 0 replies; 21+ messages in thread
From: Yongqiang Yang @ 2011-07-27  3:39 UTC (permalink / raw)
  To: Jörn Engel, Ted Ts'o; +Cc: linux-fsdevel, linux-ext4

Hi  Jörn and Ted,

Could you anyone send out the ext4 lock stat on the list?

Thank you,
Yongqiang.


On Tue, Jul 26, 2011 at 10:57 PM, Ted Ts'o <tytso@mit.edu> wrote:
> On Mon, Jul 25, 2011 at 08:20:37PM +0200, Jörn Engel wrote:
>> On Mon, 25 July 2011 11:18:25 -0400, Ted Ts'o wrote:
>> >
>> > Did you have a chance to do an ext4 lockstat run?
>
> Hi Jörn,
>
> Thanks for forwarding it to me.  It's the same problem as in XFS, the
> excessive coverage of the i_mutex lock.  In ext4's case, it's in the
> generic generic_file_aio_write() machinery where we need to do the
> lock busting.  (XFS apparently doesn't use the generic routines, so
> the fix that Dave did won't help ext3 and ext4.)
>
> I don't have the time to look at it now, but I'll put it on my todo
> list; or maybe someone with a bit more time can look into how we might
> be able to use a similar approach in the generic file system code.
>
>                                                 - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Best Wishes
Yongqiang Yang
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2011-07-27  3:39 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-17 16:05 Filesystem benchmarks on reasonably fast hardware Jörn Engel
2011-07-17 23:32 ` Dave Chinner
     [not found]   ` <20110718075339.GB1437@logfs.org>
2011-07-18 10:57     ` Dave Chinner
2011-07-18 11:40       ` Jörn Engel
2011-07-19  2:41         ` Dave Chinner
2011-07-19  7:36           ` Jörn Engel
2011-07-19  9:23             ` srimugunthan dhandapani
2011-07-21 19:05               ` Jörn Engel
2011-07-19 10:15             ` Dave Chinner
2011-07-18 14:34       ` Jörn Engel
     [not found]     ` <20110718103956.GE1437@logfs.org>
2011-07-18 11:10       ` Dave Chinner
2011-07-18 12:07 ` Ted Ts'o
2011-07-18 12:42   ` Jörn Engel
2011-07-25 15:18     ` Ted Ts'o
2011-07-25 18:20       ` Jörn Engel
2011-07-25 21:18         ` Ted Ts'o
2011-07-26 14:57         ` Ted Ts'o
2011-07-27  3:39           ` Yongqiang Yang
2011-07-19 13:19 ` Dave Chinner
2011-07-21 10:42   ` Jörn Engel
2011-07-22 18:51     ` Jörn Engel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.