All of lore.kernel.org
 help / color / mirror / Atom feed
* EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?
@ 2010-02-27  0:31 Justin Piszcz
  2010-02-27  0:46 ` Dmitry Monakhov
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Justin Piszcz @ 2010-02-27  0:31 UTC (permalink / raw)
  To: linux-ext4, linux-kernel; +Cc: Alan Piszcz

Hello,

Is it possible to 'optimize' ext4 so it is as fast as XFS for writes?
I see about half the performance as XFS for sequential writes.

I have checked the doc and tried several options, a few of which are shown
below (I have also tried the commit/journal_async/etc options but none of 
them get the write speeds anywhere near XFS)?

Sure 'dd' is not a real benchmark, etc, etc, but with 10Gbps between 2 
hosts I get 550MiB/s+ on reads from EXT4 but only 100-200MiB/s write.

When it was XFS I used to get 400-600MiB/s for writes for the same RAID 
volume.

How do I 'speed' up ext4?  Is it possible?

raid0_11 disks: (XFS)
# /dev/md0        /r1             xfs     noatime         0       1
p63:/r1# dd if=/dev/zero of=bigfile1 bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 18.1021 s, 593 MB/s
p63:/r1#

raid0_11 disks: (EXT4)
# /dev/md0        /r1             ext4     noatime         0       1
# dd if=/dev/zero of=file bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 35.3741 s, 304 MB/s
p63:/r1#

Other tests (ext4)
p63:~# mount /dev/md0 /r1 -o data=writeback
p63:~# cd /r1
p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 39.8746 s, 269 MB/s
p63:/r1#

p63:~# mount /dev/md0 /r1 -o data=writeback,nobarrier
p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 40.0656 s, 268 MB/s

Justin.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?
  2010-02-27  0:31 EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes? Justin Piszcz
@ 2010-02-27  0:46 ` Dmitry Monakhov
  2010-02-27  1:05     ` Justin Piszcz
  2010-02-27  0:51 ` Eric Sandeen
  2010-03-02  0:08 ` Eric Sandeen
  2 siblings, 1 reply; 24+ messages in thread
From: Dmitry Monakhov @ 2010-02-27  0:46 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-ext4, linux-kernel, Alan Piszcz

Justin Piszcz <jpiszcz@lucidpixels.com> writes:

> Hello,
>
> Is it possible to 'optimize' ext4 so it is as fast as XFS for writes?
> I see about half the performance as XFS for sequential writes.
>
> I have checked the doc and tried several options, a few of which are shown
> below (I have also tried the commit/journal_async/etc options but none of 
> them get the write speeds anywhere near XFS)?
>
> Sure 'dd' is not a real benchmark, etc, etc, but with 10Gbps between 2 
> hosts I get 550MiB/s+ on reads from EXT4 but only 100-200MiB/s write.
>
> When it was XFS I used to get 400-600MiB/s for writes for the same RAID 
> volume.
>
> How do I 'speed' up ext4?  Is it possible?
I don't know how to speedup, but i do know how to slowdown XFS :)
Seems that you forget to call fsync at the end of file write
In this case some data may reside in memory cache.
Please add  "conv=fsync" or "conv=fdatasync" to the dd cmd.
And redone your measurements.
>
> raid0_11 disks: (XFS)
> # /dev/md0        /r1             xfs     noatime         0       1
> p63:/r1# dd if=/dev/zero of=bigfile1 bs=1M count=10240
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB) copied, 18.1021 s, 593 MB/s
> p63:/r1#
>
> raid0_11 disks: (EXT4)
> # /dev/md0        /r1             ext4     noatime         0       1
> # dd if=/dev/zero of=file bs=1M count=10240
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB) copied, 35.3741 s, 304 MB/s
> p63:/r1#
>
> Other tests (ext4)
> p63:~# mount /dev/md0 /r1 -o data=writeback
> p63:~# cd /r1
> p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB) copied, 39.8746 s, 269 MB/s
> p63:/r1#
>
> p63:~# mount /dev/md0 /r1 -o data=writeback,nobarrier
> p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB) copied, 40.0656 s, 268 MB/s
>
> Justin.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?
  2010-02-27  0:31 EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes? Justin Piszcz
  2010-02-27  0:46 ` Dmitry Monakhov
@ 2010-02-27  0:51 ` Eric Sandeen
  2010-02-27  1:08   ` Justin Piszcz
  2010-03-02  0:08 ` Eric Sandeen
  2 siblings, 1 reply; 24+ messages in thread
From: Eric Sandeen @ 2010-02-27  0:51 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-ext4, linux-kernel, Alan Piszcz

Justin Piszcz wrote:
> Hello,
> 
> Is it possible to 'optimize' ext4 so it is as fast as XFS for writes?
> I see about half the performance as XFS for sequential writes.
> 
> I have checked the doc and tried several options, a few of which are shown
> below (I have also tried the commit/journal_async/etc options but none
> of them get the write speeds anywhere near XFS)?
> 
> Sure 'dd' is not a real benchmark, etc, etc, but with 10Gbps between 2
> hosts I get 550MiB/s+ on reads from EXT4 but only 100-200MiB/s write.
> 
> When it was XFS I used to get 400-600MiB/s for writes for the same RAID
> volume.
> 
> How do I 'speed' up ext4?  Is it possible?

Aside from Dmitry's suggestion to time sync as well (although for 10G, you are
likely not leaving much in cache) I'd ask:

What kernel version?  what xfsprogs/e2fsprogs version?

Were the filesystems created to align with raid geometry?

mkfs.xfs has done that forever; mkfs.ext4 only will do so (automatically)
with recent kernel+e2fsprogs.

-Eric

> raid0_11 disks: (XFS)
> # /dev/md0        /r1             xfs     noatime         0       1
> p63:/r1# dd if=/dev/zero of=bigfile1 bs=1M count=10240
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB) copied, 18.1021 s, 593 MB/s
> p63:/r1#
> 
> raid0_11 disks: (EXT4)
> # /dev/md0        /r1             ext4     noatime         0       1
> # dd if=/dev/zero of=file bs=1M count=10240
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB) copied, 35.3741 s, 304 MB/s
> p63:/r1#
> 
> Other tests (ext4)
> p63:~# mount /dev/md0 /r1 -o data=writeback
> p63:~# cd /r1
> p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB) copied, 39.8746 s, 269 MB/s
> p63:/r1#
> 
> p63:~# mount /dev/md0 /r1 -o data=writeback,nobarrier
> p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB) copied, 40.0656 s, 268 MB/s
> 
> Justin.
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?
  2010-02-27  0:46 ` Dmitry Monakhov
@ 2010-02-27  1:05     ` Justin Piszcz
  0 siblings, 0 replies; 24+ messages in thread
From: Justin Piszcz @ 2010-02-27  1:05 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-ext4, linux-kernel, xfs, Alan Piszcz



On Sat, 27 Feb 2010, Dmitry Monakhov wrote:

> Justin Piszcz <jpiszcz@lucidpixels.com> writes:
>
>> Hello,
>>
>> Is it possible to 'optimize' ext4 so it is as fast as XFS for writes?
>> I see about half the performance as XFS for sequential writes.
>>
>> I have checked the doc and tried several options, a few of which are shown
>> below (I have also tried the commit/journal_async/etc options but none of
>> them get the write speeds anywhere near XFS)?
>>
>> Sure 'dd' is not a real benchmark, etc, etc, but with 10Gbps between 2
>> hosts I get 550MiB/s+ on reads from EXT4 but only 100-200MiB/s write.
>>
>> When it was XFS I used to get 400-600MiB/s for writes for the same RAID
>> volume.
>>
>> How do I 'speed' up ext4?  Is it possible?
> I don't know how to speedup, but i do know how to slowdown XFS :)
> Seems that you forget to call fsync at the end of file write
> In this case some data may reside in memory cache.
> Please add  "conv=fsync" or "conv=fdatasync" to the dd cmd.
> And redone your measurements.

Hi,

First with a sync added in the total time (still 2x as fast)

EXT3:
p63:~# mount /dev/md0 -o nobarrier,data=writeback /r1
p63:~# cd /r1
p63:/r1# /usr/bin/time bash -c 'dd if=/dev/zero of=file bs=1M count=10240; sync'
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 35.4163 s, 303 MB/s
0.02user 19.85system 0:36.97elapsed 53%CPU (0avgtext+0avgdata 7296maxresident)k
0inputs+0outputs (5major+1145minor)pagefaults 0swaps

XFS:
p63:/r1# /usr/bin/time bash -c 'dd if=/dev/zero of=file bs=1M count=10240; sync'
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 18.08 s, 594 MB/s
0.03user 16.15system 0:18.67elapsed 86%CPU (0avgtext+0avgdata 7312maxresident)k
0inputs+0outputs (5major+1147minor)pagefaults 0swaps
p63:/r1#

Per your request: conv=fsync & conv=fdatasync


XFS:
p63:/r1# /usr/bin/time bash -c 'dd if=/dev/zero of=file bs=1M conv=fsync count=10240'
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 18.2142 s, 590 MB/s
0.03user 16.05system 0:18.21elapsed 88%CPU (0avgtext+0avgdata 7312maxresident)k
0inputs+0outputs (0major+832minor)pagefaults 0swaps
p63:/r1#

EXT3:
p63:/r1# /usr/bin/time bash -c 'dd if=/dev/zero of=file bs=1M conv=fdatasync count=10240'
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 39.5562 s, 271 MB/s

XFS:
p63:/r1# /usr/bin/time bash -c 'dd if=/dev/zero of=file bs=1M conv=fdatasync count=10240'
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 18.513 s, 580 MB/s
0.03user 16.25system 0:18.51elapsed 87%CPU (0avgtext+0avgdata 7312maxresident)k
0inputs+0outputs (5major+828minor)pagefaults 0swaps
p63:/r1#

p63:/r1# /usr/bin/time bash -c 'dd if=/dev/zero of=file bs=1M conv=fsync count=10240'
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 39.7859 s, 270 MB/s
0.02user 24.20system 0:39.79elapsed 60%CPU (0avgtext+0avgdata 7328maxresident)k
0inputs+0outputs (5major+829minor)pagefaults 0swaps
p63:/r1#

It is still 2x as fast?
Is there some other option I am missing here or is this correct?

Justin.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?
@ 2010-02-27  1:05     ` Justin Piszcz
  0 siblings, 0 replies; 24+ messages in thread
From: Justin Piszcz @ 2010-02-27  1:05 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: Alan Piszcz, linux-ext4, linux-kernel, xfs



On Sat, 27 Feb 2010, Dmitry Monakhov wrote:

> Justin Piszcz <jpiszcz@lucidpixels.com> writes:
>
>> Hello,
>>
>> Is it possible to 'optimize' ext4 so it is as fast as XFS for writes?
>> I see about half the performance as XFS for sequential writes.
>>
>> I have checked the doc and tried several options, a few of which are shown
>> below (I have also tried the commit/journal_async/etc options but none of
>> them get the write speeds anywhere near XFS)?
>>
>> Sure 'dd' is not a real benchmark, etc, etc, but with 10Gbps between 2
>> hosts I get 550MiB/s+ on reads from EXT4 but only 100-200MiB/s write.
>>
>> When it was XFS I used to get 400-600MiB/s for writes for the same RAID
>> volume.
>>
>> How do I 'speed' up ext4?  Is it possible?
> I don't know how to speedup, but i do know how to slowdown XFS :)
> Seems that you forget to call fsync at the end of file write
> In this case some data may reside in memory cache.
> Please add  "conv=fsync" or "conv=fdatasync" to the dd cmd.
> And redone your measurements.

Hi,

First with a sync added in the total time (still 2x as fast)

EXT3:
p63:~# mount /dev/md0 -o nobarrier,data=writeback /r1
p63:~# cd /r1
p63:/r1# /usr/bin/time bash -c 'dd if=/dev/zero of=file bs=1M count=10240; sync'
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 35.4163 s, 303 MB/s
0.02user 19.85system 0:36.97elapsed 53%CPU (0avgtext+0avgdata 7296maxresident)k
0inputs+0outputs (5major+1145minor)pagefaults 0swaps

XFS:
p63:/r1# /usr/bin/time bash -c 'dd if=/dev/zero of=file bs=1M count=10240; sync'
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 18.08 s, 594 MB/s
0.03user 16.15system 0:18.67elapsed 86%CPU (0avgtext+0avgdata 7312maxresident)k
0inputs+0outputs (5major+1147minor)pagefaults 0swaps
p63:/r1#

Per your request: conv=fsync & conv=fdatasync


XFS:
p63:/r1# /usr/bin/time bash -c 'dd if=/dev/zero of=file bs=1M conv=fsync count=10240'
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 18.2142 s, 590 MB/s
0.03user 16.05system 0:18.21elapsed 88%CPU (0avgtext+0avgdata 7312maxresident)k
0inputs+0outputs (0major+832minor)pagefaults 0swaps
p63:/r1#

EXT3:
p63:/r1# /usr/bin/time bash -c 'dd if=/dev/zero of=file bs=1M conv=fdatasync count=10240'
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 39.5562 s, 271 MB/s

XFS:
p63:/r1# /usr/bin/time bash -c 'dd if=/dev/zero of=file bs=1M conv=fdatasync count=10240'
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 18.513 s, 580 MB/s
0.03user 16.25system 0:18.51elapsed 87%CPU (0avgtext+0avgdata 7312maxresident)k
0inputs+0outputs (5major+828minor)pagefaults 0swaps
p63:/r1#

p63:/r1# /usr/bin/time bash -c 'dd if=/dev/zero of=file bs=1M conv=fsync count=10240'
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 39.7859 s, 270 MB/s
0.02user 24.20system 0:39.79elapsed 60%CPU (0avgtext+0avgdata 7328maxresident)k
0inputs+0outputs (5major+829minor)pagefaults 0swaps
p63:/r1#

It is still 2x as fast?
Is there some other option I am missing here or is this correct?

Justin.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?
  2010-02-27  0:51 ` Eric Sandeen
@ 2010-02-27  1:08   ` Justin Piszcz
  2010-02-27  1:12     ` Eric Sandeen
  0 siblings, 1 reply; 24+ messages in thread
From: Justin Piszcz @ 2010-02-27  1:08 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-ext4, linux-kernel, Alan Piszcz



On Fri, 26 Feb 2010, Eric Sandeen wrote:

> Justin Piszcz wrote:
>> Hello,
>>
>> Is it possible to 'optimize' ext4 so it is as fast as XFS for writes?
>> I see about half the performance as XFS for sequential writes.
>>
>> I have checked the doc and tried several options, a few of which are shown
>> below (I have also tried the commit/journal_async/etc options but none
>> of them get the write speeds anywhere near XFS)?
>>
>> Sure 'dd' is not a real benchmark, etc, etc, but with 10Gbps between 2
>> hosts I get 550MiB/s+ on reads from EXT4 but only 100-200MiB/s write.
>>
>> When it was XFS I used to get 400-600MiB/s for writes for the same RAID
>> volume.
>>
>> How do I 'speed' up ext4?  Is it possible?
>
> Aside from Dmitry's suggestion to time sync as well (although for 10G, you are
> likely not leaving much in cache) I'd ask:
>
> What kernel version?  what xfsprogs/e2fsprogs version?
2.6.33/x86_64

ii  xfsprogs                                3.1.1                    Utilities for managing the XFS filesystem
ii  e2fsprogs                               1.41.10-1                ext2/ext3/ext4 file system utilities

>
> Were the filesystems created to align with raid geometry?
Only default options were used except the mount options.  If that is the
culprit, I have some more testing to do, thanks, will look into it.

>
> mkfs.xfs has done that forever; mkfs.ext4 only will do so (automatically)
> with recent kernel+e2fsprogs.
How recent?



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?
  2010-02-27  1:08   ` Justin Piszcz
@ 2010-02-27  1:12     ` Eric Sandeen
  2010-02-27  1:28       ` Eric Sandeen
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Sandeen @ 2010-02-27  1:12 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-ext4, linux-kernel, Alan Piszcz

Justin Piszcz wrote:
...

>> Were the filesystems created to align with raid geometry?
> Only default options were used except the mount options.  If that is the
> culprit, I have some more testing to do, thanks, will look into it.
> 
>>
>> mkfs.xfs has done that forever; mkfs.ext4 only will do so (automatically)
>> with recent kernel+e2fsprogs.
> How recent?

You're recent enough.  :)

mkfs.ext4 output should include the stripe info if it was found.

        printf(_("Block size=%u (log=%u)\n"), fs->blocksize,
                s->s_log_block_size);
        printf(_("Fragment size=%u (log=%u)\n"), fs->fragsize,
                s->s_log_frag_size);
        printf(_("Stride=%u blocks, Stripe width=%u blocks\n"),
               s->s_raid_stride, s->s_raid_stripe_width);
        printf(_("%u inodes, %llu blocks\n"), s->s_inodes_count,
               ext2fs_blocks_count(s));

etc.

-Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?
  2010-02-27  1:12     ` Eric Sandeen
@ 2010-02-27  1:28       ` Eric Sandeen
  2010-02-27 10:14         ` Justin Piszcz
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Sandeen @ 2010-02-27  1:28 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-ext4, linux-kernel, Alan Piszcz

Eric Sandeen wrote:
> Justin Piszcz wrote:
> ...
> 
>>> Were the filesystems created to align with raid geometry?
>> Only default options were used except the mount options.  If that is the
>> culprit, I have some more testing to do, thanks, will look into it.
>>
>>> mkfs.xfs has done that forever; mkfs.ext4 only will do so (automatically)
>>> with recent kernel+e2fsprogs.
>> How recent?
> 
> You're recent enough.  :)

Oh, you need very recent util-linux-ng as well, and use libblkid from there
with:

[e2fsprogs] # ./configure --disable-libblkid

Otherwise you can just feed mkfs.ext4 stripe & stride manually.

-Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?
  2010-02-27  1:28       ` Eric Sandeen
@ 2010-02-27 10:14         ` Justin Piszcz
  2010-02-27 10:51           ` Justin Piszcz
  0 siblings, 1 reply; 24+ messages in thread
From: Justin Piszcz @ 2010-02-27 10:14 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-ext4, linux-kernel, Alan Piszcz



On Fri, 26 Feb 2010, Eric Sandeen wrote:

> Eric Sandeen wrote:
>
> Oh, you need very recent util-linux-ng as well, and use libblkid from there
> with:
>
> [e2fsprogs] # ./configure --disable-libblkid
>
> Otherwise you can just feed mkfs.ext4 stripe & stride manually.
>
> -Eric
>

Hi,

Even when set, there is still poor performance:

http://busybox.net/~aldot/mkfs_stride.html
Raid Level: 0
Number of Physical Disks: 11
RAID chunk size (in KiB): 1024
number of filesystem blocks (in KiB)
mkfs.ext4 -b 4096 -E stride=256,stripe-width=2816

p63:~# /usr/bin/time mkfs.ext4 -b 4096 -E stride=256,stripe-width=2816 /dev/md0
mke2fs 1.41.10 (10-Feb-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=256 blocks, Stripe width=2816 blocks
335765504 inodes, 1343055824 blocks
67152791 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
40987 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
         32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
         4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
         102400000, 214990848, 512000000, 550731776, 644972544

Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 38 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.
p63:~#

p63:~# mount /dev/md0 /r1 -o nobarrier,data=writeback
p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 39.3674 s, 273 MB/s
p63:/r1#

Still very slow?

Let's try with some optimizations:
p63:/r1#  mount /dev/md0 /r1 -o noatime,barrier=0,data=writeback,nobh,commit=100,nouser_xattr,nodelalloc,max_batch_time=0^C

Still not anywhere near 500-600MiB/s of XFS:
p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 30.4824 s, 352 MB/s
p63:/r1#

Am I doing something wrong/is there a flag I am missing that will speed it
up?  Or is this performance for sequential writes on EXT4?

Justin.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?
  2010-02-27 10:14         ` Justin Piszcz
@ 2010-02-27 10:51           ` Justin Piszcz
  2010-02-27 11:09             ` Justin Piszcz
  0 siblings, 1 reply; 24+ messages in thread
From: Justin Piszcz @ 2010-02-27 10:51 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-ext4, linux-kernel, Alan Piszcz



On Sat, 27 Feb 2010, Justin Piszcz wrote:

> 
> 
> On Fri, 26 Feb 2010, Eric Sandeen wrote:
> 
> > Eric Sandeen wrote:
> >
> > Oh, you need very recent util-linux-ng as well, and use libblkid from there
> > with:
> >
> > [e2fsprogs] # ./configure --disable-libblkid
> >
> > Otherwise you can just feed mkfs.ext4 stripe & stride manually.
> >
> > -Eric
> >


I also tried with the default chunk size (64KiB) incase ext4 had a problem
with chunk sizes > 64KiB, the results were the same for ext4, I also tried
ext2 & ext3 as well just to see what their performance would be:

p63:~# mkfs.ext2 -b 4096 -E stride=16,stripe-width=176 /dev/md0
p63:~# mount /dev/md0 /r1
p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
10737418240 bytes (11 GB) copied, 19.9434 s, 538 MB/s
p63:/r1#

p63:~# mkfs.ext3 -b 4096 -E stride=16,stripe-width=176 /dev/md0
p63:~# mount /dev/md0 /r1
p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
10737418240 bytes (11 GB) copied, 31.0195 s, 346 MB/s

p63:~# mkfs.ext4 -b 4096 -E stride=16,stripe-width=176 /dev/md0
p63:~# mount /dev/md0 /r1
p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
10737418240 bytes (11 GB) copied, 35.3866 s, 303 MB/s

And, for comparison, XFS:
p63:~# mkfs.xfs -f /dev/md0 > /dev/null 2>&1
p63:~# mount /dev/md0 /r1
p63:~# cd /r1
p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 18.1527 s, 592 MB/s
p63:/r1#


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?
  2010-02-27 10:51           ` Justin Piszcz
@ 2010-02-27 11:09             ` Justin Piszcz
  2010-02-27 11:36               ` Justin Piszcz
  0 siblings, 1 reply; 24+ messages in thread
From: Justin Piszcz @ 2010-02-27 11:09 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-ext4, linux-kernel, Alan Piszcz



On Sat, 27 Feb 2010, Justin Piszcz wrote:

> 
> 
> On Sat, 27 Feb 2010, Justin Piszcz wrote:
> 
> > 
> > 
> > On Fri, 26 Feb 2010, Eric Sandeen wrote:
> >

Hi,

I have found the same results on 2 different systems:

It seems to peak at ~350MiB/s performance on mdadm raid, whether
a RAID-5 or RAID-0 (two separate machines):

The only option I found that allows it to go from:
10737418240 bytes (11 GB) copied, 48.7335 s, 220 MB/s
to
10737418240 bytes (11 GB) copied, 30.5425 s, 352 MB/s

Is the -o nodelalloc option.

How come it is not breaking the 350MiB/s barrier is the question?

Justin.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?
  2010-02-27 11:09             ` Justin Piszcz
@ 2010-02-27 11:36               ` Justin Piszcz
  2010-02-28  5:42                 ` tytso
  2010-02-28 23:50                 ` Dave Chinner
  0 siblings, 2 replies; 24+ messages in thread
From: Justin Piszcz @ 2010-02-27 11:36 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-ext4, linux-kernel, Alan Piszcz



On Sat, 27 Feb 2010, Justin Piszcz wrote:

> 
> 
> On Sat, 27 Feb 2010, Justin Piszcz wrote:
> 
> > 
> > 
> > On Sat, 27 Feb 2010, Justin Piszcz wrote:
> > 
> > > 
> > > 
> > > On Fri, 26 Feb 2010, Eric Sandeen wrote:
> > >
> 
> Hi,
> 
> I have found the same results on 2 different systems:
> 
> It seems to peak at ~350MiB/s performance on mdadm raid, whether
> a RAID-5 or RAID-0 (two separate machines):
> 
> The only option I found that allows it to go from:
> 10737418240 bytes (11 GB) copied, 48.7335 s, 220 MB/s
> to
> 10737418240 bytes (11 GB) copied, 30.5425 s, 352 MB/s
> 
> Is the -o nodelalloc option.
> 
> How come it is not breaking the 350MiB/s barrier is the question?
> 
> Justin.
> 
>

Besides large sequential I/O, ext4 seems to be MUCH faster than XFS when
working with many small files.

EXT4

p63:/r1# sync; /usr/bin/time bash -c 'tar xf linux-2.6.33.tar; sync'
0.18user 2.43system 0:02.86elapsed 91%CPU (0avgtext+0avgdata 5216maxresident)k
0inputs+0outputs (0major+971minor)pagefaults 0swaps
linux-2.6.33  linux-2.6.33.tar
p63:/r1# sync; /usr/bin/time bash -c 'rm -rf linux-2.6.33; sync'
0.02user 0.98system 0:01.03elapsed 97%CPU (0avgtext+0avgdata 5216maxresident)k
0inputs+0outputs (0major+865minor)pagefaults 0swaps

XFS

p63:/r1# sync; /usr/bin/time bash -c 'tar xf linux-2.6.33.tar; sync'
0.20user 2.62system 1:03.90elapsed 4%CPU (0avgtext+0avgdata 5200maxresident)k
0inputs+0outputs (0major+970minor)pagefaults 0swaps
p63:/r1# sync; /usr/bin/time bash -c 'rm -rf linux-2.6.33; sync'
0.03user 2.02system 0:29.04elapsed 7%CPU (0avgtext+0avgdata 5200maxresident)k
0inputs+0outputs (0major+864minor)pagefaults 0swaps

So I guess that's the tradeoff, for massive I/O you should use XFS, else,
use EXT4?

I still would like to know however, why 350MiB/s seems to be the maximum
performance I can get from two different md raids (that easily do 600MiB/s
with XFS).

Is this a performance issue within ext4 and md-raid?
The problem does not exist with xfs and md-raid.

Justin.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?
  2010-02-27  1:05     ` Justin Piszcz
  (?)
@ 2010-02-28  0:56     ` Asdo
  2010-02-28  9:59       ` Justin Piszcz
  -1 siblings, 1 reply; 24+ messages in thread
From: Asdo @ 2010-02-28  0:56 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-kernel

Justin Piszcz wrote:
>
>
> On Sat, 27 Feb 2010, Dmitry Monakhov wrote:
>
>> Justin Piszcz <jpiszcz@lucidpixels.com> writes:
>>
>>> Hello,
>>>
>>> Is it possible to 'optimize' ext4 so it is as fast as XFS for writes?
>>> I see about half the performance as XFS for sequential writes.
>>>
>>> I have checked the doc and tried several options, a few of which are 
>>> shown
>>> below (I have also tried the commit/journal_async/etc options but 
>>> none of
>>> them get the write speeds anywhere near XFS)?
>>>
>>> Sure 'dd' is not a real benchmark, etc, etc, but with 10Gbps between 2
>>> hosts I get 550MiB/s+ on reads from EXT4 but only 100-200MiB/s write.
>>>
>>> When it was XFS I used to get 400-600MiB/s for writes for the same RAID
>>> volume.
>>>
>>> How do I 'speed' up ext4?  Is it possible?
Hi Justin
sorry for being OT in my reply (I can't answer your question unfortunately)
You can really get 550MiB/sec through a 10gigabit ethernet connection?
I didn't think it was possible. Just a few years ago it seems to me 
there were problems in obtaining a full gigabit out of 1Gigabit ethernet 
adapters...
Is it running some kind of offloading like TOE, or RDMA or other magic 
things? (maybe by default... you can check something with ethtool 
--show-offload eth0, but TOE isn't there)
Or really computers became so fast and I missed something...?
Sorry for the stupid question
(pls note: I removed most CC recipients because I went OT)

Thank you

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?
  2010-02-27 11:36               ` Justin Piszcz
@ 2010-02-28  5:42                 ` tytso
  2010-02-28 14:55                   ` Justin Piszcz
  2010-02-28 23:50                 ` Dave Chinner
  1 sibling, 1 reply; 24+ messages in thread
From: tytso @ 2010-02-28  5:42 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Eric Sandeen, linux-ext4, linux-kernel, Alan Piszcz

On Sat, Feb 27, 2010 at 06:36:37AM -0500, Justin Piszcz wrote:
> 
> I still would like to know however, why 350MiB/s seems to be the maximum
> performance I can get from two different md raids (that easily do 600MiB/s
> with XFS).

Can you run "filefrag -v <filename>" on the large file you created
using dd?  Part of the problem may be the block allocator simply not
being well optimized super large writes.  To be honest, that's not
something we've tried (at all) to optimize, mainly because for most
users of ext4 they're more interested in much more reasonable sized
files, and we only have so many hours in a day to hack on ext4.  :-)
XFS in contrast has in the past had plenty of paying customers
interested in writing really large scientific data sets, so this is
something Irix *has* spent time optimizing.

As far as I know none of the ext4 developers' day jobs are currently
focused on really large files using ext4.  Some of us do use ext4 to
support really large files, but it's via some kind of cluster or
parallel file system layered on top of ext4 (i.e., Sun/Clusterfs
Lustre File Systems, or Google's GFS) --- and so what gets actually
stored in ext4 isn't a single 10-20 gigabyte file.

I'm saying this not as an excuse; but it's an explanation for why no
one has really noticed this performance problem until you brought it
up.  I'd like to see ext4 be a good general purpose file system, which
includes handling the really big files stored in a single system.  But
it's just not something we've tried optimizing yet.

So if you can gather some data, such as the filefrag information, that
would be a great first step.  Something else that would be useful is
gathering blktrace information, so we can see how we are scheduling
the writes and whether we have something bad going on there.  I
wouldn't be surprised if there is some stupidity going on in the
generic FS/MM writeback code which is throttling us, and which XFS has
worked around.  Ext4 has worked around some writeback brain-damage
already, but I've been focused on much smaller files (files in the
tens or hundreds megabytes) since that's what I tend to use much more
frequently.

It's great to see that you're really interested in this; if you're
willing to do some investigative work, hopefully it's something we can
address.

Best Regards,

						- Ted

P.S.  I'm a bit unclear regarding your comment about "-o nodelalloc"
in one of your earlier threads.  Does using nodelalloc actually speeds
things up?  There were a bunch of numbers being thrown around, and in
some configurations I thought you were getting around 300 MB/s without
using nodelalloc?  Or am I misunderstanding your numbers and what
configuratoins you used with each test run?

If nodelalloc is actually speeding things up, then we almost certainly
have some kind of writeback problem.  So filefrag and blktrace are
definitely the tools we need to look at to understand what is going
on.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?
  2010-02-28  0:56     ` Asdo
@ 2010-02-28  9:59       ` Justin Piszcz
  0 siblings, 0 replies; 24+ messages in thread
From: Justin Piszcz @ 2010-02-28  9:59 UTC (permalink / raw)
  To: Asdo; +Cc: linux-kernel



On Sun, 28 Feb 2010, Asdo wrote:

> Justin Piszcz wrote:
>> 
>> 
>> On Sat, 27 Feb 2010, Dmitry Monakhov wrote:
>> 
>>> Justin Piszcz <jpiszcz@lucidpixels.com> writes:
>>> 
>>>> Hello,
>>>> 
>>>> Is it possible to 'optimize' ext4 so it is as fast as XFS for writes?
>>>> I see about half the performance as XFS for sequential writes.
>>>> 
>>>> I have checked the doc and tried several options, a few of which are 
>>>> shown
>>>> below (I have also tried the commit/journal_async/etc options but none of
>>>> them get the write speeds anywhere near XFS)?
>>>> 
>>>> Sure 'dd' is not a real benchmark, etc, etc, but with 10Gbps between 2
>>>> hosts I get 550MiB/s+ on reads from EXT4 but only 100-200MiB/s write.
>>>> 
>>>> When it was XFS I used to get 400-600MiB/s for writes for the same RAID
>>>> volume.
>>>> 
>>>> How do I 'speed' up ext4?  Is it possible?
> Hi Justin
> sorry for being OT in my reply (I can't answer your question unfortunately)
> You can really get 550MiB/sec through a 10gigabit ethernet connection?
Yes, I am capped by the disk I/O, the network card itself card does ~1 
gigabyte per second over iperf.  If I had two raid systems that 
did >= 1Gbyte/sec read+write AND enough PCI-e bandwidth, it is plausible to
see (large-files) transferring at 10Gbps speeds.

> I didn't think it was possible. Just a few years ago it seems to me there 
> were problems in obtaining a full gigabit out of 1Gigabit ethernet 
> adapters...
I have been running gigabit for awhile now and have been able to saturate
it for some time between Linux hosts.  If you are referring to windows and 
the transfer rates via samba, their networking stack did not get 'fixed' 
until Windows 7, otherwise it seemd like it was 'capped' at 40-60MiB/s, 
regardless of the HW.  With 7, you always get ~100MiB/s if your HW is fast 
enough.  A single Intel X25-E SSD can read > 200MiB/s as can many of the 
newer SSDs being released (the Micron 6Gbps) pusing 300MiB/s.  As SSDs 
become more mainstream, gigabit will become more and more of a bottleneck.

> Is it running some kind of offloading like TOE, or RDMA or other magic 
> things? (maybe by default... you can check something with ethtool 
Yes, check the features here (page 2/4), half way down:
http://www.intel.com/Assets/PDF/prodbrief/318349.pdf

> --show-offload eth0, but TOE isn't there)
> Or really computers became so fast and I missed something...?
PCI-express (for the bandwidth) (not PCI-X), jumbo frames (mtu=9000) 
and the 2.6 kernel.

> Sorry for the stupid question
> (pls note: I removed most CC recipients because I went OT)
>
> Thank you
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?
  2010-02-28  5:42                 ` tytso
@ 2010-02-28 14:55                   ` Justin Piszcz
  2010-03-01  8:39                     ` Andreas Dilger
  2010-03-01 16:15                     ` Eric Sandeen
  0 siblings, 2 replies; 24+ messages in thread
From: Justin Piszcz @ 2010-02-28 14:55 UTC (permalink / raw)
  To: tytso; +Cc: Eric Sandeen, linux-ext4, linux-kernel, Alan Piszcz



On Sun, 28 Feb 2010, tytso@mit.edu wrote:

> On Sat, Feb 27, 2010 at 06:36:37AM -0500, Justin Piszcz wrote:
>>
>> I still would like to know however, why 350MiB/s seems to be the maximum
>> performance I can get from two different md raids (that easily do 600MiB/s
>> with XFS).

> Can you run "filefrag -v <filename>" on the large file you created
> using dd?  Part of the problem may be the block allocator simply not
> being well optimized super large writes.  To be honest, that's not
> something we've tried (at all) to optimize, mainly because for most
> users of ext4 they're more interested in much more reasonable sized
> files, and we only have so many hours in a day to hack on ext4.  :-)
> XFS in contrast has in the past had plenty of paying customers
> interested in writing really large scientific data sets, so this is
> something Irix *has* spent time optimizing.
Yes, this is shown at the bottom of the e-mail both with -o data=ordered
and data=writeback.

[ .. ]

> So if you can gather some data, such as the filefrag information, that
> would be a great first step.  Something else that would be useful is
> gathering blktrace information, so we can see how we are scheduling
> the writes and whether we have something bad going on there.  I
> wouldn't be surprised if there is some stupidity going on in the
> generic FS/MM writeback code which is throttling us, and which XFS has
> worked around.  Ext4 has worked around some writeback brain-damage
> already, but I've been focused on much smaller files (files in the
> tens or hundreds megabytes) since that's what I tend to use much more
> frequently.
>
> It's great to see that you're really interested in this; if you're
> willing to do some investigative work, hopefully it's something we can
> address.

[ .. ]

> P.S.  I'm a bit unclear regarding your comment about "-o nodelalloc"
> in one of your earlier threads.  Does using nodelalloc actually speeds
> things up?  There were a bunch of numbers being thrown around, and in
> some configurations I thought you were getting around 300 MB/s without
> using nodelalloc?  Or am I misunderstanding your numbers and what
> configuratoins you used with each test run?
This is more dramatic on the software raid (mdadm) RAID-5 configuration. 
Without -o nodelalloc, I see roughly 200MiB/s.  With -o nodelalloc, I hit 
the same barrier as the RAID-0, 350MiB/s, but its effect on RAID-0 is less 
dramatic.  The full tests and output appear at the bottom of this e-mail; 
however, for brevity, the example below shows 55MiB/s and 132MiB/s
performance increases with RAID-0 and RAID-5 respectively:

For the RAID-0:

-o data=writeback,nobarrier:
10737418240 bytes (11 GB) copied, 34.755 s, 309 MB/s
-o data=writeback,nobarrier,nodelalloc:
10737418240 bytes (11 GB) copied, 29.5299 s, 364 MB/s
An increase of 55MiB/s.

For the RAID-5 (from earlier testing):

-o data=writeback,nobarrier:
10737418240 bytes (11 GB) copied, 48.7335 s, 220 MB/s
-o data=writeback,nobarrier,nodelalloc:
10737418240 bytes (11 GB) copied, 30.5425 s, 352 MB/s
An increase of 132MiB/s.

>
> If nodelalloc is actually speeding things up, then we almost certainly
> have some kind of writeback problem.  So filefrag and blktrace are
> definitely the tools we need to look at to understand what is going
> on.
>

=== CREATE RAID-0 WITH 11 DISKS

p63:~# mdadm --create -e 0.90 /dev/md0 /dev/sd[b-l]1 --level=0 -n 11 -c 64
mdadm: /dev/sdb1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdc1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdd1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sde1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdf1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdg1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdh1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdi1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdj1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdk1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdl1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
Continue creating array? y
mdadm: array /dev/md0 started.
p63:~#

=== SHOW MDADM RAID-0

p63:~# mdadm -D /dev/md0
/dev/md0:
         Version : 0.90
   Creation Time : Sun Feb 28 06:31:41 2010
      Raid Level : raid0
      Array Size : 5372223296 (5123.35 GiB 5501.16 GB)
    Raid Devices : 11
   Total Devices : 11
Preferred Minor : 0
     Persistence : Superblock is persistent

     Update Time : Sun Feb 28 06:31:41 2010
           State : clean
  Active Devices : 11
Working Devices : 11
  Failed Devices : 0
   Spare Devices : 0

      Chunk Size : 64K

            UUID : 077d4d5c:5acbcb29:26614430:c3345183 (local to host p63)
          Events : 0.1

     Number   Major   Minor   RaidDevice State
        0       8       17        0      active sync   /dev/sdb1
        1       8       33        1      active sync   /dev/sdc1
        2       8       49        2      active sync   /dev/sdd1
        3       8       65        3      active sync   /dev/sde1
        4       8       81        4      active sync   /dev/sdf1
        5       8       97        5      active sync   /dev/sdg1
        6       8      113        6      active sync   /dev/sdh1
        7       8      129        7      active sync   /dev/sdi1
        8       8      145        8      active sync   /dev/sdj1
        9       8      161        9      active sync   /dev/sdk1
       10       8      177       10      active sync   /dev/sdl1
p63:~#

=== KERNEL CONFIGURATION BASELINE

The following kernel configuration was used:
http://home.comcast.net/~jpiszcz/20100228/config-2.6.33-baseline.txt

=== ESTABLISH CONTROL / BASELINE

p63:~# mkfs.xfs /dev/md0 -f
meta-data=/dev/md0               isize=256    agcount=32, agsize=41970496 blks
          =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=1343055824, imaxpct=5
          =                       sunit=16     swidth=176 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=521728, version=2
          =                       sectsz=512   sunit=16 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
p63:~# mount /dev/md0 /r1 -o nobarrier
p63:/r1# /usr/bin/time dd if=/dev/zero of=bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 17.9816 s, 597 MB/s
0.03user 16.10system 0:17.99elapsed 89%CPU (0avgtext+0avgdata 7312maxresident)k
0inputs+0outputs (1major+495minor)pagefaults 0swaps
p63:/r1#

p63:/r1# xfs_bmap -v /r1/bigfile 
/r1/bigfile:
  EXT: FILE-OFFSET      BLOCK-RANGE          AG AG-OFFSET           TOTAL FLAGS
    0: [0..20971519]:   671528064..692499583  2 (128..20971647)  20971520 00011
p63:/r1#

=== CREATE EXT4 FILESYSTEM ON ARRAY (note the stripe/width appears to be
                                      irrelevant to to the speed problem as
                                      as the cap is '350MiB/s' whether it is
                                      aligned or not, see the following URL
                                      for those tests)
                                      http://lkml.org/lkml/2010/2/27/77

                                      NOTE: It compares ext2 vs. ext3 vs. ext4
                                            vs. XFS.

                                      NOTE: nobarrier does not seem to be a
                                            factor either, but I will include
                                            it below to ensure it is not
                                            somehow impacting the tests
                                            performed.

p63:~# /usr/bin/time mkfs.ext4 /dev/md0
mke2fs 1.41.10 (10-Feb-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
335765504 inodes, 1343055824 blocks
67152791 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
40987 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
         32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
         4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
         102400000, 214990848, 512000000, 550731776, 644972544

Writing inode tables: done 
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 36 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.
6.50user 83.89system 2:01.86elapsed 74%CPU (0avgtext+0avgdata 829552maxresident)k
0inputs+0outputs (5major+51889minor)pagefaults 0swaps
p63:~#

=== MOUNT FILESYSTEM WITH NOBARRIER, ORDERED (DEFAULT) & RUN TEST

p63:/r1# /usr/bin/time dd if=/dev/zero of=bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 35.2676 s, 304 MB/s
0.02user 19.40system 0:35.29elapsed 55%CPU (0avgtext+0avgdata 7312maxresident)k
0inputs+0outputs (3major+493minor)pagefaults 0swaps
p63:/r1#

=== SHOW FILEFRAG OUTPUT (NOBARRIER,ORDERED)

p63:/r1# filefrag -v /r1/bigfile 
Filesystem type is: ef53
File size of /r1/bigfile is 10737418240 (2621440 blocks, blocksize 4096)
  ext logical  physical  expected length flags
    0       0     34816            32768
    1   32768     67584            30720
    2   63488    100352     98303  32768
    3   96256    133120            30720
    4  126976    165888    163839  32768
    5  159744    198656            30720
    6  190464    231424    229375  32768
    7  223232    264192            30720
    8  253952    296960    294911  32768
    9  286720    329728            32768
   10  319488    362496            32768
   11  352256    395264            32768
   12  385024    428032            32768
   13  417792    460800            32768
   14  450560    493568            30720
   15  481280    557056    524287  32768
   16  514048    589824            32768
   17  546816    622592            32768
   18  579584    655360            32768
   19  612352    688128            32768
   20  645120    720896            32768
   21  677888    753664            32768
   22  710656    786432            32768
   23  743424    821248    819199  32768
   24  776192    854016            30720
   25  806912    886784    884735  32768
   26  839680    919552            32768
   27  872448    952320            32768
   28  905216    985088            32768
   29  937984   1017856            30720
   30  968704   1081344   1048575  32768
   31 1001472   1114112            32768
   32 1034240   1146880            32768
   33 1067008   1179648            32768
   34 1099776   1212416            32768
   35 1132544   1245184            32768
   36 1165312   1277952            32768
   37 1198080   1310720            32768
   38 1230848   1343488            32768
   39 1263616   1376256            32768
   40 1296384   1409024            32768
   41 1329152   1441792            32768
   42 1361920   1474560            32768
   43 1394688   1507328            32768
   44 1427456   1540096            32768
   45 1460224   1607680   1572863  32768
   46 1492992   1640448            32768
   47 1525760   1673216            32768
   48 1558528   1705984            32768
   49 1591296   1738752            32768
   50 1624064   1771520            32768
   51 1656832   1804288            32768
   52 1689600   1837056            32768
   53 1722368   1869824            32768
   54 1755136   1902592            32768
   55 1787904   1935360            32768
   56 1820672   1968128            32768
   57 1853440   2000896            32768
   58 1886208   2033664            32768
   59 1918976   2066432            30720
   60 1949696   2129920   2097151  32768
   61 1982464   2162688            32768
   62 2015232   2195456            32768
   63 2048000   2228224            32768
   64 2080768   2260992            32768
   65 2113536   2293760            32768
   66 2146304   2326528            32768
   67 2179072   2359296            32768
   68 2211840   2392064            32768
   69 2244608   2424832            32768
   70 2277376   2457600            32768
   71 2310144   2490368            32768
   72 2342912   2523136            32768
   73 2375680   2555904            32768
   74 2408448   2588672            32768
   75 2441216   2656256   2621439  32768
   76 2473984   2689024            32768
   77 2506752   2721792            32768
   78 2539520   2754560            32768
   79 2572288   2787328            18432
   80 2590720   2818048   2805759  30720 eof
/r1/bigfile: 13 extents found
p63:/r1#

=== MOUNT FILESYSTEM WITH NOBARRIER, WRITEBACK & RUN TEST

p63:/# mount /dev/md0 -o data=writeback,nobarrier /r1
p63:/r1# /usr/bin/time dd if=/dev/zero of=bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 34.755 s, 309 MB/s
0.02user 19.38system 0:34.78elapsed 55%CPU (0avgtext+0avgdata 7280maxresident)k
0inputs+0outputs (3major+491minor)pagefaults 0swaps
p63:/r1#

=== SHOW FILEFRAG OUTPUT (NOBARRIER,WRITEBACK)

p63:/r1# filefrag -v /r1/bigfile 
Filesystem type is: ef53
File size of /r1/bigfile is 10737418240 (2621440 blocks, blocksize 4096)
  ext logical  physical  expected length flags
    0       0     34816            32768
    1   32768     67584            30720
    2   63488    100352     98303  32768
    3   96256    133120            30720
    4  126976    165888    163839  32768
    5  159744    198656            30720
    6  190464    231424    229375  32768
    7  223232    264192            30720
    8  253952    296960    294911  32768
    9  286720    329728            32768
   10  319488    362496            32768
   11  352256    395264            32768
   12  385024    428032            32768
   13  417792    460800            32768
   14  450560    493568            30720
   15  481280    557056    524287  32768
   16  514048    589824            32768
   17  546816    622592            32768
   18  579584    655360            32768
   19  612352    688128            32768
   20  645120    720896            32768
   21  677888    753664            32768
   22  710656    786432            32768
   23  743424    821248    819199  32768
   24  776192    854016            30720
   25  806912    886784    884735  32768
   26  839680    919552            32768
   27  872448    952320            32768
   28  905216    985088            32768
   29  937984   1017856            30720
   30  968704   1081344   1048575  32768
   31 1001472   1114112            32768
   32 1034240   1146880            32768
   33 1067008   1179648            32768
   34 1099776   1212416            32768
   35 1132544   1245184            32768
   36 1165312   1277952            32768
   37 1198080   1310720            32768
   38 1230848   1343488            32768
   39 1263616   1376256            32768
   40 1296384   1409024            32768
   41 1329152   1441792            32768
   42 1361920   1474560            32768
   43 1394688   1507328            32768
   44 1427456   1540096            32768
   45 1460224   1607680   1572863  32768
   46 1492992   1640448            32768
   47 1525760   1673216            32768
   48 1558528   1705984            32768
   49 1591296   1738752            32768
   50 1624064   1771520            32768
   51 1656832   1804288            32768
   52 1689600   1837056            32768
   53 1722368   1869824            32768
   54 1755136   1902592            32768
   55 1787904   1935360            32768
   56 1820672   1968128            32768
   57 1853440   2000896            32768
   58 1886208   2033664            32768
   59 1918976   2066432            30720
   60 1949696   2129920   2097151  32768
   61 1982464   2162688            32768
   62 2015232   2195456            32768
   63 2048000   2228224            32768
   64 2080768   2260992            32768
   65 2113536   2293760            32768
   66 2146304   2326528            32768
   67 2179072   2359296            32768
   68 2211840   2392064            32768
   69 2244608   2424832            32768
   70 2277376   2457600            32768
   71 2310144   2490368            32768
   72 2342912   2523136            32768
   73 2375680   2555904            32768
   74 2408448   2588672            32768
   75 2441216   2656256   2621439  32768
   76 2473984   2689024            32768
   77 2506752   2721792            32768
   78 2539520   2754560            16384 
/r1/bigfile: 12 extents found
p63:/r1#

=== USE OF -o nodelalloc WITH SOFTWARE RAID-0 (SPEED IMPROVEMENT)

p63:/r1# mount /dev/md0 -o data=writeback,nobarrier,nodelalloc /r1
p63:/r1# /usr/bin/time dd if=/dev/zero of=bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 29.5299 s, 364 MB/s
0.02user 28.95system 0:29.56elapsed 98%CPU (0avgtext+0avgdata 7312maxresident)k
0inputs+0outputs (3major+493minor)pagefaults 0swaps
p63:/r1#

While it does help, I have not been able to get > 400MiB/s, it stops at 
roughly 350-360MiB/s.

=== FIRST ATTEMPT AT USING BLKTRACE

Following these docs:
http://git.kernel.org/?p=linux/kernel/git/axboe/blktrace.git;a=blob;f=README
http://github.com/znmeb/linux_perf_viz/raw/master/blktrace-howto/blktrace-howto.pdf
http://pdfedit.petricek.net/bt/file_download.php?file_id=17&type=bug
http://www.cse.unsw.edu.au/~aaronc/iosched/doc/blktrace.html

Options required in the kernel:

Kernel hacking:
   | |    [*] Debug Filesystem                                             | |

Then the BLK_IO_TRACE (it has moved from where the old docs say to go)
Kernel Hacking:
   | |    [ ] Tracers  --->                                                | |
   | |    [*]   Support for tracing block IO actions                       | |

Compile new kernel, reboot.

New kernel configuration used (only enabled the options shown above)
http://home.comcast.net/~jpiszcz/20100228/config-2.6.33-blktrace.txt

Next step, create a fresh filesystem for the trace event:

p63:~# /usr/bin/time mkfs.ext4 /dev/md0
< .. >
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

Reboot to new kernel.

Per:
http://pdfedit.petricek.net/bt/file_download.php?file_id=17&type=bug

Mount the debug filesystem/make sure it iss mounted:

p63:~# mount -t debugfs debugfs /sys/kernel/debug
mount: debugfs already mounted or /sys/kernel/debug busy
mount: according to mtab, debugfs is already mounted on /sys/kernel/debug
p63:~#

Then follow instructions on page 14 from:
http://github.com/znmeb/linux_perf_viz/raw/master/blktrace-howto/blktrace-howto.pdf

p63:/dev/shm/server# blktrace -l
server: waiting for connections...
server: connection from 192.168.168.113

p63:/dev/shm/client# blktrace -h 192.168.168.113 /dev/md0
blktrace: connecting to 192.168.168.113
blktrace: connected!

Mount filesystem with -o data=writeback,nobarrier, run test blktrace1.

p63:~# mount -o data=writeback,nobarrier /dev/md0 /r1
p63:~# cd /r1
p63:/r1# /usr/bin/time dd if=/dev/zero of=bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 35.6317 s, 301 MB/s
0.03user 19.41system 0:35.67elapsed 54%CPU (0avgtext+0avgdata 7312maxresident)k
0inputs+0outputs (2major+494minor)pagefaults 0swaps
p63:/r1# rm bigfile 
p63:/r1# sync
p63:/r1# cd
p63:~# umount /r1
p63:~#

SERVER PROCESS:

p63:/dev/shm/server# blktrace -l
server: waiting for connections...
server: connection from 192.168.168.113
server: end of run for 192.168.168.113:md0
=== md0 ===
   CPU  0:              1548634 events,    72593 KiB data
   CPU  1:              1009268 events,    47310 KiB data
   Total:               2557902 events (dropped 0),   119902 KiB data
p63:/dev/shm/server# ls

CLIENT PROCESS:

# blktrace -h 192.168.168.113 /dev/md0
blktrace: connecting to 192.168.168.113
blktrace: connected!
^C=== md0 ===
   CPU  0:              1548634 events,    72593 KiB data
   CPU  1:              1009268 events,    47310 KiB data
   Total:               2557902 events (dropped 0),   119902 KiB data

>From this test, the following resulted:
# du -sh *
56K     192.168.168.113-2010-02-28-13:10:48
118M    192.168.168.113-2010-02-28-13:14:00

Let this trace be called blktrace1.

p63:/dev/shm/server# du -sh blktrace1/*
56K     blktrace1/192.168.168.113-2010-02-28-13:10:48
118M    blktrace1/192.168.168.113-2010-02-28-13:14:00
p63:/dev/shm/server#

Mount with -o data=writeback,nobarrier,nodelalloc, run test blktrace2.

p63:~# umount /r1
p63:~# mount -o data=writeback,nobarrier,nodelalloc /dev/md0 /r1
p63:~# cd /r1
p63:/r1# /usr/bin/time dd if=/dev/zero of=bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 30.6692 s, 350 MB/s
0.03user 29.55system 0:30.70elapsed 96%CPU (0avgtext+0avgdata 7312maxresident)k
0inputs+0outputs (1major+495minor)pagefaults 0swaps
p63:/r1# rm bigfile 
p63:/r1# sync
p63:/r1# cd
p63:~# umount /r1
p63:~#

SERVER PROCESS:

p63:/dev/shm/server# blktrace -l
server: waiting for connections...
server: connection from 192.168.168.113
server: end of run for 192.168.168.113:md0
=== md0 ===
   CPU  0:                50056 events,     2347 KiB data
   CPU  1:              2478242 events,   116168 KiB data
   Total:               2528298 events (dropped 0),   118515 KiB data

CLIENT PROCESS:

# blktrace -h 192.168.168.113 /dev/md0
blktrace: connecting to 192.168.168.113
blktrace: connected!
^C=== md0 ===
   CPU  0:                50056 events,     2347 KiB data
   CPU  1:              2478242 events,   116168 KiB data
   Total:               2528298 events (dropped 0),   118515 KiB data
#

p63:/dev/shm/server# du -sh 192.168.168.113-2010-02-28-13\:17\:22/*
2.4M    192.168.168.113-2010-02-28-13:17:22/md0.blktrace.0
114M    192.168.168.113-2010-02-28-13:17:22/md0.blktrace.1

This is blktrace2.

One more time (blktrace3) with ordered.

p63:~# mount -o nobarrier /dev/md0 /r1
p63:~# dmesg | tail -n 2
[ 2788.928806] EXT4-fs (md0): barriers disabled
[ 2789.340573] EXT4-fs (md0): mounted filesystem with ordered data mode
p63:~# cd /r1
p63:/r1# /usr/bin/time dd if=/dev/zero of=bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 36.2893 s, 296 MB/s
0.04user 19.29system 0:36.32elapsed 53%CPU (0avgtext+0avgdata 7296maxresident)k
0inputs+0outputs (1major+494minor)pagefaults 0swaps
p63:/r1# rm bigfile 
p63:/r1# sync
p63:/r1# cd
p63:~# umount /r1
p63:~#

SERVER PROCESS:

p63:/dev/shm/server# blktrace -l
server: waiting for connections...
server: connection from 192.168.168.113
server: end of run for 192.168.168.113:md0
=== md0 ===
   CPU  0:              1587087 events,    74395 KiB data
   CPU  1:               970979 events,    45515 KiB data
   Total:               2558066 events (dropped 0),   119910 KiB data
p63:/dev/shm/server#

CLIENT PROCESS:

# blktrace -h 192.168.168.113 /dev/md0
blktrace: connecting to 192.168.168.113
blktrace: connected!
=== md0 ===
   CPU  0:              1587087 events,    74395 KiB data
   CPU  1:               970979 events,    45515 KiB data
   Total:               2558066 events (dropped 0),   119910 KiB data
#

TRACE OUTPUT TOTAL AND SUMMARY:


p63:~/results-20100228# du -sh *
570M    blktrace1       => -o data=writeback,nobarrier
570M    blktrace1-redo  => -o data=writeback,nobarrier
563M    blktrace2       => -o data=writeback,nobarrier,nodelalloc
570M    blktrace3       => -o data=nobarrier
4.0K    script
p63:~/results-20100228#

USING SCRIPT ON PAGE 24/30:

Running post-process.sh for each trace: blktrace{1,2,3}, the script itself
from page 24/30:

# cat /root/post-process.sh 
#! /bin/bash
blkrawverify md0 # check data for errors
blkparse -d md0.bin -i md0 > md0.blkparse # merged binary, parsed
btt -i md0.bin --all-data > md0.btt # basic btt report
# now the whole enchilada!
btt -i md0.bin -o md0x --all-data --easy-parse-avgs \
--iostat=md0x.iostat \
--per-io-dump=md0x.pid \
--q2d-latencies=md0x \
--d2c-latencies=md0x \
--q2c-latencies=md0x \
--dump-blocknos=md0x_dbn \
--active-queue-depth=md0x \
--unplug-hist=md0x_uph \
--seeks=seeks \
--seeks-per-second=sps \
--per-io-trees=md0x_pit \
> md0x.btt # md0x.btt is empty

#

Before running any tests, backup raw data:

p63:/dev/shm# tar cf /root/server.tar server
p63:/dev/shm#

For each directory, run post-process:

blktrace1: (I must have waited too long in between steps here so it made two)

p63:/dev/shm/server/blktrace1# ls -1
192.168.168.113-2010-02-28-13:10:48
192.168.168.113-2010-02-28-13:14:00
p63:/dev/shm/server/blktrace1# cd *48
p63:/dev/shm/server/blktrace1/192.168.168.113-2010-02-28-13:10:48# /root/post-process.sh 
Verifying md0
     CPU 0
     CPU 1 
Wrote output to md0.verify.out
p63:/dev/shm/server/blktrace1/192.168.168.113-2010-02-28-13:10:48# cd ../*00
p63:/dev/shm/server/blktrace1/192.168.168.113-2010-02-28-13:14:00# /root/post-process.sh 
Verifying md0
     CPU 0
     CPU 1 
Wrote output to md0.verify.out
p63:/dev/shm/server/blktrace1/192.168.168.113-2010-02-28-13:14:00#

I will make another blktrace1 and be faster this time so all data results
are of the same type, it is called blktrace1-redo:

blktrace1-redo:

p63:/dev/shm/server/blktrace1-redo/192.168.168.113-2010-02-28-13:35:45# /root/post-process.sh 
Verifying md0
     CPU 0
     CPU 1 
Wrote output to md0.verify.out
p63:/dev/shm/server/blktrace1-redo/192.168.168.113-2010-02-28-13:35:45#

blktrace2:

p63:/dev/shm/server/blktrace2/192.168.168.113-2010-02-28-13:17:22# /root/post-process.sh 
Verifying md0
     CPU 0
     CPU 1 
Wrote output to md0.verify.out
p63:/dev/shm/server/blktrace2/192.168.168.113-2010-02-28-13:17:22#

blktrace3:

p63:/dev/shm/server/blktrace3/192.168.168.113-2010-02-28-13:31:29# /root/post-process.sh 
Verifying md0
     CPU 0
     CPU 1 
Wrote output to md0.verify.out
p63:/dev/shm/server/blktrace3/192.168.168.113-2010-02-28-13:31:29#

------------

=== FINAL RESULTS


p63:~/results-20100228# du -sh */*
216K    blktrace1/192.168.168.113-2010-02-28-13:10:48
570M    blktrace1/192.168.168.113-2010-02-28-13:14:00
570M    blktrace1-redo/192.168.168.113-2010-02-28-13:35:45
563M    blktrace2/192.168.168.113-2010-02-28-13:17:22
570M    blktrace3/192.168.168.113-2010-02-28-13:31:29
4.0K    script/post-process.sh
p63:~/results-20100228#

I used 7zip to compress the results because it offers the best compression
ratio of any other utility, including the latest 'xz' utility:
http://fixunix.com/kernel/238089-response-kernel-compression-e-mail-few-months-ago.html

$ xz -9 linux-2.6.16.17.tar

$ du -sk * | sort -n
32392 linux-2.6.16.17.tar.7z
32404 linux-2.6.16.17.tar.xz
33520 linux-2.6.16.17.tar.lzma
33760 linux-2.6.16.17.tar.rar
38064 linux-2.6.16.17.tar.rz
39472 linux-2.6.16.17.tar.szip
39520 linux-2.6.16.17.tar.bz
39936 linux-2.6.16.17.tar.bz2
40000 linux-2.6.16.17.tar.bicom
40656 linux-2.6.16.17.tar.sit
47664 linux-2.6.16.17.tar.lha
49968 linux-2.6.16.17.tar.dzip
50000 linux-2.6.16.17.tar.gz
51344 linux-2.6.16.17.tar.arj
57552 linux-2.6.16.17.tar.lzo
57984 linux-2.6.16.17.tar.F
81136 linux-2.6.16.17.tar.Z
94544 linux-2.6.16.17.tar.zoo
101216 linux-2.6.16.17.tar.arc
228608 linux-2.6.16.17.tar

=== COMPRESSION RESULTS:

-rw-r--r-- 1 abc users 155M 2010-02-28 09:02 results-20100228.tar.7z
-rw-r--r-- 1 abc users 290M 2010-02-28 08:42 results-20100228.tar.bz2
-rw-r--r-- 1 abc users 2.3G 2010-02-28 08:42 results-20100228.tar

=== LOCATION:

http://liquidswords.org/~war/results-20100228.tar.7z
wget http://liquidswords.org/~war/results-20100228.tar.7z

=== MD5 CHECKSUM:

$ md5sum *
1db01600ce2700854b4bafcfd68f7028  results-20100228.tar.7z
35793b283edf5c0f38738276812aad52  results-20100228.tar

=== VERIFICATION: MAKE SURE IT WORKS FOR OTHERS:

$ wget http://liquidswords.org/~war/results-20100228.tar.7z
--2010-02-28 09:48:36--  http://liquidswords.org/~war/results-20100228.tar.7z
Resolving liquidswords.org... 71.6.165.232
Connecting to liquidswords.org|71.6.165.232|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 161814574 (154M) [application/x-tar]
Saving to: "results-20100228.tar.7z"

100%[======================================>] 161,814,574 2.00M/s   in 69s

$

$ md5sum *7z
1db01600ce2700854b4bafcfd68f7028  results-20100228.tar.7z

CORRECT

$ 7z x results-20100228.tar.7z

7-Zip  4.58 beta  Copyright (c) 1999-2008 Igor Pavlov  2008-05-05
p7zip Version 4.58 (locale=en_US,Utf16=on,HugeFiles=on,8 CPUs)

Processing archive: results-20100228.tar.7z

Extracting  results-20100228.tar

Everything is Ok

Size:       2382561280
Compressed: 161814574

$ md5sum *tar
35793b283edf5c0f38738276812aad52  results-20100228.tar

CORRECT

Again, the trace information details:

p63:~/results-20100228# du -sh *
570M    blktrace1       => -o data=writeback,nobarrier
570M    blktrace1-redo  => -o data=writeback,nobarrier
563M    blktrace2       => -o data=writeback,nobarrier,nodelalloc
570M    blktrace3       => -o data=nobarrier
4.0K    script
p63:~/results-20100228#

Let me know if you need anything else, thanks.

Justin.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?
  2010-02-27 11:36               ` Justin Piszcz
  2010-02-28  5:42                 ` tytso
@ 2010-02-28 23:50                 ` Dave Chinner
  1 sibling, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2010-02-28 23:50 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Eric Sandeen, linux-ext4, linux-kernel, Alan Piszcz

On Sat, Feb 27, 2010 at 06:36:37AM -0500, Justin Piszcz wrote:
> Besides large sequential I/O, ext4 seems to be MUCH faster than XFS when
> working with many small files.
>
> EXT4
>
> p63:/r1# sync; /usr/bin/time bash -c 'tar xf linux-2.6.33.tar; sync'
> 0.18user 2.43system 0:02.86elapsed 91%CPU (0avgtext+0avgdata 5216maxresident)k
> 0inputs+0outputs (0major+971minor)pagefaults 0swaps
> linux-2.6.33  linux-2.6.33.tar
> p63:/r1# sync; /usr/bin/time bash -c 'rm -rf linux-2.6.33; sync'
> 0.02user 0.98system 0:01.03elapsed 97%CPU (0avgtext+0avgdata 5216maxresident)k
> 0inputs+0outputs (0major+865minor)pagefaults 0swaps
>
> XFS
>
> p63:/r1# sync; /usr/bin/time bash -c 'tar xf linux-2.6.33.tar; sync'
> 0.20user 2.62system 1:03.90elapsed 4%CPU (0avgtext+0avgdata 5200maxresident)k
> 0inputs+0outputs (0major+970minor)pagefaults 0swaps
> p63:/r1# sync; /usr/bin/time bash -c 'rm -rf linux-2.6.33; sync'
> 0.03user 2.02system 0:29.04elapsed 7%CPU (0avgtext+0avgdata 5200maxresident)k
> 0inputs+0outputs (0major+864minor)pagefaults 0swaps

Mount XFS with "-o logbsize=262144". Metadata intensive workloads on
XFS are log IO bound, so larger log buffer size makes a big
difference. On 2.6.33 kernels on a single 15krpm SCSI drive I've
been getting ~21s for the untar, and 8s for the rm -rf with that
option set. Still slower than ext4, but nowhere near as bad.

> So I guess that's the tradeoff, for massive I/O you should use XFS, else,
> use EXT4?

I wouldn't consider writing an 11GB file "massive IO", nor would I
consider an 600MB/s massive, either, since you can get that out of a
sub-$10k server these days....

> I still would like to know however, why 350MiB/s seems to be the maximum
> performance I can get from two different md raids (that easily do 600MiB/s
> with XFS).

Check whether the dd process on ext4 is CPU bound....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?
  2010-02-28 14:55                   ` Justin Piszcz
@ 2010-03-01  8:39                     ` Andreas Dilger
  2010-03-01  9:21                       ` Justin Piszcz
  2010-03-01 16:15                     ` Eric Sandeen
  1 sibling, 1 reply; 24+ messages in thread
From: Andreas Dilger @ 2010-03-01  8:39 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: tytso, Eric Sandeen, linux-ext4, linux-kernel, Alan Piszcz

On 2010-02-28, at 07:55, Justin Piszcz wrote:
> === CREATE RAID-0 WITH 11 DISKS


Have you tried testing with "nice" numbers of disks in your RAID set  
(e.g. 8 disks for RAID-0, 9 for RAID-5, 10 for RAID-6)?  The mballoc  
code is really much better tuned for power-of-two sized allocations.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?
  2010-03-01  8:39                     ` Andreas Dilger
@ 2010-03-01  9:21                       ` Justin Piszcz
  2010-03-01 14:48                         ` Michael Tokarev
  0 siblings, 1 reply; 24+ messages in thread
From: Justin Piszcz @ 2010-03-01  9:21 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: tytso, Eric Sandeen, linux-ext4, linux-kernel, Alan Piszcz



On Mon, 1 Mar 2010, Andreas Dilger wrote:

> On 2010-02-28, at 07:55, Justin Piszcz wrote:
>> === CREATE RAID-0 WITH 11 DISKS
>
>
> Have you tried testing with "nice" numbers of disks in your RAID set (e.g. 8 
> disks for RAID-0, 9 for RAID-5, 10 for RAID-6)?  The mballoc code is really 
> much better tuned for power-of-two sized allocations.

Hi,

Yes, the second system (RAID-5) has 8 disks and it shows the same 
performance problems with ext4 and not XFS (as shown from previous 
e-mail), where XFS usually got 500-600MiB/s for writes.

http://groups.google.com/group/linux.kernel/browse_thread/thread/e7b189bcaa2c1cb4/ad6c2a54b678cf5f?show_docid=ad6c2a54b678cf5f&pli=1

For the RAID-5 (from earlier testing):  <- This one has 8 disks.
-o data=writeback,nobarrier: 
10737418240 bytes (11 GB) copied, 48.7335 s, 220 MB/s 
-o data=writeback,nobarrier,nodelalloc: 
10737418240 bytes (11 GB) copied, 30.5425 s, 352 MB/s 
An increase of 132MiB/s.

Justin.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?
  2010-03-01  9:21                       ` Justin Piszcz
@ 2010-03-01 14:48                         ` Michael Tokarev
  2010-03-01 15:07                           ` Justin Piszcz
  0 siblings, 1 reply; 24+ messages in thread
From: Michael Tokarev @ 2010-03-01 14:48 UTC (permalink / raw)
  To: Justin Piszcz
  Cc: Andreas Dilger, tytso, Eric Sandeen, linux-ext4, linux-kernel,
	Alan Piszcz

Justin Piszcz wrote:
> 
> On Mon, 1 Mar 2010, Andreas Dilger wrote:
> 
>> On 2010-02-28, at 07:55, Justin Piszcz wrote:
>>> === CREATE RAID-0 WITH 11 DISKS
>>
>> Have you tried testing with "nice" numbers of disks in your RAID set
>> (e.g. 8 disks for RAID-0, 9 for RAID-5, 10 for RAID-6)?  The mballoc
>> code is really much better tuned for power-of-two sized allocations.
> 
> Hi,
> 
> Yes, the second system (RAID-5) has 8 disks and it shows the same
> performance problems with ext4 and not XFS (as shown from previous
> e-mail), where XFS usually got 500-600MiB/s for writes.
> 
> http://groups.google.com/group/linux.kernel/browse_thread/thread/e7b189bcaa2c1cb4/ad6c2a54b678cf5f?show_docid=ad6c2a54b678cf5f&pli=1
> 
> 
> For the RAID-5 (from earlier testing):  <- This one has 8 disks.

Note that for RAID-5, the "nice" number of disks is 9 as Andreas
said, not 8 as in your example.

/mjt

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?
  2010-03-01 14:48                         ` Michael Tokarev
@ 2010-03-01 15:07                           ` Justin Piszcz
  0 siblings, 0 replies; 24+ messages in thread
From: Justin Piszcz @ 2010-03-01 15:07 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: Andreas Dilger, tytso, Eric Sandeen, linux-ext4, linux-kernel,
	Alan Piszcz



On Mon, 1 Mar 2010, Michael Tokarev wrote:

> Justin Piszcz wrote:
>>
>> On Mon, 1 Mar 2010, Andreas Dilger wrote:
>>
>>> On 2010-02-28, at 07:55, Justin Piszcz wrote:
>>>> === CREATE RAID-0 WITH 11 DISKS
>>>
>>> Have you tried testing with "nice" numbers of disks in your RAID set
>>> (e.g. 8 disks for RAID-0, 9 for RAID-5, 10 for RAID-6)?  The mballoc
>>> code is really much better tuned for power-of-two sized allocations.
>>
>> Hi,
>>
>> Yes, the second system (RAID-5) has 8 disks and it shows the same
>> performance problems with ext4 and not XFS (as shown from previous
>> e-mail), where XFS usually got 500-600MiB/s for writes.
>>
>> http://groups.google.com/group/linux.kernel/browse_thread/thread/e7b189bcaa2c1cb4/ad6c2a54b678cf5f?show_docid=ad6c2a54b678cf5f&pli=1
>>
>>
>> For the RAID-5 (from earlier testing):  <- This one has 8 disks.
>
> Note that for RAID-5, the "nice" number of disks is 9 as Andreas
> said, not 8 as in your example.
>
> /mjt
>

Hi, thanks for this.

RAID-0 with 12 disks:

p63:~# mdadm --create -e 0.90 /dev/md0 /dev/sd[b-m]1 --level=0 -n 12 -c 64
mdadm: /dev/sdb1 appears to contain an ext2fs file system
     size=1077256000K  mtime=Sun Feb 28 08:35:47 2010
mdadm: /dev/sdb1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdc1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdd1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sde1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdf1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdg1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdh1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdi1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdj1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdk1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdl1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdm1 appears to be part of a raid array:
     level=raid6 devices=11 ctime=Sat Feb 27 06:57:29 2010
Continue creating array? y
mdadm: array /dev/md0 started.
p63:~# mkfs.ext4 /dev/md0
mke2fs 1.41.10 (10-Feb-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
366288896 inodes, 1465151808 blocks
73257590 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
44713 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
         32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
         4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
         102400000, 214990848, 512000000, 550731776, 644972544

Writing inode tables: 28936/44713..etc

p63:~# mount -o nobarrier /dev/md0 /r1
p63:~# cd /r1
p63:/r1# dd if=/dev/zero of=bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 34.9723 s, 307 MB/s
p63:/r1#

Same issue for EXT4, with XFS, it gets faster:

p63:~# mkfs.xfs /dev/md0 -f
meta-data=/dev/md0               isize=256    agcount=32, agsize=45786000 blks
          =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=1465151808, imaxpct=5
          =                       sunit=16     swidth=192 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=521728, version=2
          =                       sectsz=512   sunit=16 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
mount /dev/md0 /r1
p63:~# mount /dev/md0 /r1
p63:~# cd /r1
p63:/r1# dd if=/dev/zero of=bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 17.6473 s, 608 MB/s
p63:/r1#

Justin.





^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?
  2010-02-28 14:55                   ` Justin Piszcz
  2010-03-01  8:39                     ` Andreas Dilger
@ 2010-03-01 16:15                     ` Eric Sandeen
  1 sibling, 0 replies; 24+ messages in thread
From: Eric Sandeen @ 2010-03-01 16:15 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: tytso, linux-ext4, linux-kernel, Alan Piszcz

Justin Piszcz wrote:
> 
> 
> On Sun, 28 Feb 2010, tytso@mit.edu wrote:
> 
>> On Sat, Feb 27, 2010 at 06:36:37AM -0500, Justin Piszcz wrote:
>>>
>>> I still would like to know however, why 350MiB/s seems to be the maximum
>>> performance I can get from two different md raids (that easily do
>>> 600MiB/s
>>> with XFS).
> 
>> Can you run "filefrag -v <filename>" on the large file you created
>> using dd?  Part of the problem may be the block allocator simply not
>> being well optimized super large writes.  To be honest, that's not
>> something we've tried (at all) to optimize, mainly because for most
>> users of ext4 they're more interested in much more reasonable sized
>> files, and we only have so many hours in a day to hack on ext4.  :-)
>> XFS in contrast has in the past had plenty of paying customers
>> interested in writing really large scientific data sets, so this is
>> something Irix *has* spent time optimizing.
> Yes, this is shown at the bottom of the e-mail both with -o data=ordered
> and data=writeback.

...

> === SHOW FILEFRAG OUTPUT (NOBARRIER,ORDERED)
> 
> p63:/r1# filefrag -v /r1/bigfile Filesystem type is: ef53
> File size of /r1/bigfile is 10737418240 (2621440 blocks, blocksize 4096)
>  ext logical  physical  expected length flags
>    0       0     34816            32768
>    1   32768     67584            30720
>    2   63488    100352     98303  32768
>    3   96256    133120            30720
>    4  126976    165888    163839  32768
>    5  159744    198656            30720
...

That looks pretty good.

I think Dave's suggesting of seeing what cpu usage looks like is a good one.

Running blktrace on xfs vs. ext4 could possibly also shed some light.

-Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?
  2010-02-27  0:31 EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes? Justin Piszcz
  2010-02-27  0:46 ` Dmitry Monakhov
  2010-02-27  0:51 ` Eric Sandeen
@ 2010-03-02  0:08 ` Eric Sandeen
  2010-03-02  0:37   ` Eric Sandeen
  2 siblings, 1 reply; 24+ messages in thread
From: Eric Sandeen @ 2010-03-02  0:08 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-ext4, linux-kernel, Alan Piszcz

Justin Piszcz wrote:
> Hello,
> 
> Is it possible to 'optimize' ext4 so it is as fast as XFS for writes?
> I see about half the performance as XFS for sequential writes.
> 
> I have checked the doc and tried several options, a few of which are shown
> below (I have also tried the commit/journal_async/etc options but none
> of them get the write speeds anywhere near XFS)?
> 
> Sure 'dd' is not a real benchmark, etc, etc, but with 10Gbps between 2
> hosts I get 550MiB/s+ on reads from EXT4 but only 100-200MiB/s write.
> 
> When it was XFS I used to get 400-600MiB/s for writes for the same RAID
> volume.
> 
> How do I 'speed' up ext4?  Is it possible?
> 

FWIW I'm seeing similar things on fast storage (Fusion IO),
though this is under 2.6.31.  500MB/s+ for xfs, 300 for ext4.

Overwriting an existing file is no faster.   I don't think this
driver is blktraceable but I'll try a newer driver that should be I think.

(xfs's overwrite went from 534 to 597 mb/s; ext4 sat at 320-ish)

direct IO was good for both xfs & ext4 at around 530mb/s

I'll see if I can get this running on a more recent kernel to
do further investigation.

-Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?
  2010-03-02  0:08 ` Eric Sandeen
@ 2010-03-02  0:37   ` Eric Sandeen
  0 siblings, 0 replies; 24+ messages in thread
From: Eric Sandeen @ 2010-03-02  0:37 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-ext4, linux-kernel, Alan Piszcz

Eric Sandeen wrote:
> Justin Piszcz wrote:
>> Hello,
>>
>> Is it possible to 'optimize' ext4 so it is as fast as XFS for writes?
>> I see about half the performance as XFS for sequential writes.
>>
>> I have checked the doc and tried several options, a few of which are shown
>> below (I have also tried the commit/journal_async/etc options but none
>> of them get the write speeds anywhere near XFS)?
>>
>> Sure 'dd' is not a real benchmark, etc, etc, but with 10Gbps between 2
>> hosts I get 550MiB/s+ on reads from EXT4 but only 100-200MiB/s write.
>>
>> When it was XFS I used to get 400-600MiB/s for writes for the same RAID
>> volume.
>>
>> How do I 'speed' up ext4?  Is it possible?
>>
> 
> FWIW I'm seeing similar things on fast storage (Fusion IO),
> though this is under 2.6.31.  500MB/s+ for xfs, 300 for ext4.
> 
> Overwriting an existing file is no faster.   I don't think this
> driver is blktraceable but I'll try a newer driver that should be I think.

FWIW, blktrace (I'm still on 2.6.31) is enlightening:

Total (xfs):
 Reads Queued:           4,       16KiB	 Writes Queued:     122,567,   10,485MiB
 Read Dispatches:        4,       16KiB	 Write Dispatches:   83,219,   10,485MiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:        4,       16KiB	 Writes Completed:   83,219,   10,485MiB
 Read Merges:            0,        0KiB	 Write Merges:       39,348,  314,804KiB
 IO unplugs:           344        	 Timer unplugs:         338

Total (ext4):
 Reads Queued:          14,       56KiB	 Writes Queued:       2,621K,   10,486MiB
 Read Dispatches:       14,       56KiB	 Write Dispatches:  107,944,   10,486MiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:       14,       56KiB	 Writes Completed:  107,944,   10,486MiB
 Read Merges:            0,        0KiB	 Write Merges:        2,513K,   10,054MiB
 IO unplugs:         2,461        	 Timer unplugs:       2,020


See "Writes Queued"  See also submit_bio() calls in xfs.

ext4 doing things a block at a time is certainly giving the elevator a workout...
I'd tend to chalk it up to that at first glance.

-Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2010-03-02  0:37 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-02-27  0:31 EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes? Justin Piszcz
2010-02-27  0:46 ` Dmitry Monakhov
2010-02-27  1:05   ` Justin Piszcz
2010-02-27  1:05     ` Justin Piszcz
2010-02-28  0:56     ` Asdo
2010-02-28  9:59       ` Justin Piszcz
2010-02-27  0:51 ` Eric Sandeen
2010-02-27  1:08   ` Justin Piszcz
2010-02-27  1:12     ` Eric Sandeen
2010-02-27  1:28       ` Eric Sandeen
2010-02-27 10:14         ` Justin Piszcz
2010-02-27 10:51           ` Justin Piszcz
2010-02-27 11:09             ` Justin Piszcz
2010-02-27 11:36               ` Justin Piszcz
2010-02-28  5:42                 ` tytso
2010-02-28 14:55                   ` Justin Piszcz
2010-03-01  8:39                     ` Andreas Dilger
2010-03-01  9:21                       ` Justin Piszcz
2010-03-01 14:48                         ` Michael Tokarev
2010-03-01 15:07                           ` Justin Piszcz
2010-03-01 16:15                     ` Eric Sandeen
2010-02-28 23:50                 ` Dave Chinner
2010-03-02  0:08 ` Eric Sandeen
2010-03-02  0:37   ` Eric Sandeen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.