All of lore.kernel.org
 help / color / mirror / Atom feed
* ext4, barrier, md/RAID1 and write cache
@ 2012-05-07 10:47 Daniel Pocock
  2012-05-07 16:25 ` Martin Steigerwald
  0 siblings, 1 reply; 14+ messages in thread
From: Daniel Pocock @ 2012-05-07 10:47 UTC (permalink / raw)
  To: linux-ext4



I've been having some NFS performance issues, and have been
experimenting with the server filesystem (ext4) to see if that is a factor.

The setup is like this:

(Debian 6, kernel 2.6.39)
2x SATA drive (NCQ, 32MB cache, no hardware RAID)
md RAID1
LVM
ext4

a) If I use data=ordered,barrier=1 and `hdparm -W 1' on the drive, I
observe write performance over NFS of 1MB/sec (unpacking a big source
tarball)

b) If I use data=writeback,barrier=0 and `hdparm -W 1' on the drive, I
observe write performance over NFS of 10MB/sec

c) If I just use the async option on NFS, I observe up to 30MB/sec

I believe (b) and (c) are not considered safe against filesystem
corruption, so I can't use them in practice.

Can anyone suggest where I should direct my efforts to lift performance?
 E.g.

- does SCSI work better with barriers, will buying SCSI drives just
solve the problem using config (a)?

- should I do away with md RAID and consider btrfs which does RAID1
within the filesystem itself?

- or must I just use option (b) but make it safer with battery-backed
write cache?

- or is there any md or lvm issue that can be tuned or fixed by
upgrading the kernel?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ext4, barrier, md/RAID1 and write cache
  2012-05-07 10:47 ext4, barrier, md/RAID1 and write cache Daniel Pocock
@ 2012-05-07 16:25 ` Martin Steigerwald
  2012-05-07 16:44   ` Daniel Pocock
  0 siblings, 1 reply; 14+ messages in thread
From: Martin Steigerwald @ 2012-05-07 16:25 UTC (permalink / raw)
  To: Daniel Pocock; +Cc: linux-ext4

Am Montag, 7. Mai 2012 schrieb Daniel Pocock:
> I've been having some NFS performance issues, and have been
> experimenting with the server filesystem (ext4) to see if that is a
> factor.

Which NFS version is this?
 
> The setup is like this:
> 
> (Debian 6, kernel 2.6.39)
> 2x SATA drive (NCQ, 32MB cache, no hardware RAID)
> md RAID1
> LVM
> ext4
> 
> a) If I use data=ordered,barrier=1 and `hdparm -W 1' on the drive, I
> observe write performance over NFS of 1MB/sec (unpacking a big source
> tarball)

Is this a realistic workload scenario for production use?

> b) If I use data=writeback,barrier=0 and `hdparm -W 1' on the drive, I
> observe write performance over NFS of 10MB/sec
> 
> c) If I just use the async option on NFS, I observe up to 30MB/sec
> 
> I believe (b) and (c) are not considered safe against filesystem
> corruption, so I can't use them in practice.

Partly.

b) can harm filesystem consistency unless you disable write cache on the 
disks

c) won´t harm local filesystem consistency, but should the nfs server break 
down all data that the NFS clients sent to the server for writing which is 
not written yet is gone.

> - or must I just use option (b) but make it safer with battery-backed
> write cache?

If you want performance and safety that is the best option from the ones 
you mentioned, if the workload is really I/O bound on the local filesystem. 

Of course you can try the usual tricks like noatime, remove rsize and 
wsize options on the NFS client if they have a new enough kernel (they 
autotune to much higher than the often recommended 8192 or 32768 bytes, 
look at /proc/mounts), put ext4 journal onto an extra disk to reduce head 
seeks, check whether enough NFS server threads are running, try a different 
filesystem and so on.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ext4, barrier, md/RAID1 and write cache
  2012-05-07 16:25 ` Martin Steigerwald
@ 2012-05-07 16:44   ` Daniel Pocock
  2012-05-07 16:54     ` Andreas Dilger
  0 siblings, 1 reply; 14+ messages in thread
From: Daniel Pocock @ 2012-05-07 16:44 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: linux-ext4

On 07/05/12 18:25, Martin Steigerwald wrote:
> Am Montag, 7. Mai 2012 schrieb Daniel Pocock:
>   
>> I've been having some NFS performance issues, and have been
>> experimenting with the server filesystem (ext4) to see if that is a
>> factor.
>>     
> Which NFS version is this?
>   
Normally I am using NFSv3.  I've also tried NFSv4 to see if that fixed
the problem, it made no difference.

>  
>   
>> The setup is like this:
>>
>> (Debian 6, kernel 2.6.39)
>>     
I've just updated to 3.2 from squeeze-backports, it doesn't resolve the
issue either

>> 2x SATA drive (NCQ, 32MB cache, no hardware RAID)
>> md RAID1
>> LVM
>> ext4
>>
>> a) If I use data=ordered,barrier=1 and `hdparm -W 1' on the drive, I
>> observe write performance over NFS of 1MB/sec (unpacking a big source
>> tarball)
>>     
> Is this a realistic workload scenario for production use?
>
>   

Yes, it is a small server with few users, I keep some open source
projects on it, git repositories, compiling, etc.  So the usual workload
involves unpacking source code, git checkouts, compiles that generate
many object files.  All these operations are excruciatingly slow unless
I run NFS in `async' mode, which is not considered safe.

>> b) If I use data=writeback,barrier=0 and `hdparm -W 1' on the drive, I
>> observe write performance over NFS of 10MB/sec
>>
>> c) If I just use the async option on NFS, I observe up to 30MB/sec
>>
>> I believe (b) and (c) are not considered safe against filesystem
>> corruption, so I can't use them in practice.
>>     
> Partly.
>
> b) can harm filesystem consistency unless you disable write cache on the 
> disks
>
>   

(b) only achieves any performance improvement if the write cache is
enabled - so it is not a production solution

> c) won´t harm local filesystem consistency, but should the nfs server break 
> down all data that the NFS clients sent to the server for writing which is 
> not written yet is gone.
>
>   
Most of the access is from NFS, so (c) is not a good solution either.

>> - or must I just use option (b) but make it safer with battery-backed
>> write cache?
>>     
> If you want performance and safety that is the best option from the ones 
> you mentioned, if the workload is really I/O bound on the local filesystem. 
>
> Of course you can try the usual tricks like noatime, remove rsize and 
> wsize options on the NFS client if they have a new enough kernel (they 
> autotune to much higher than the often recommended 8192 or 32768 bytes, 
> look at /proc/mounts), put ext4 journal onto an extra disk to reduce head 
> seeks, check whether enough NFS server threads are running, try a different 
> filesystem and so on.
>
>   
One further discovery I made: I decided to eliminate md and LVM.  I had
enough space to create a 256MB partition on one of the disks, and format
it directly with ext4

Writing to that partition from the NFS3 client:
- less than 500kBytes/sec (for unpacking a tarball of source code)
- around 50MB/sec (dd if=/dev/zero conv=fsync bs=65536)

and I then connected an old 5400rpm USB disk to the machine, ran the
same test from the NFS client:
- 5MBytes/sec (for unpacking a tarball of source code) - 10x faster than
the 72k SATA disk

This last test (comparing my AHCI SATA disk to the USB disk, with no md
or LVM) makes me think it is not an NFS problem, I feel it is some issue
with the barriers when used with this AHCI or SATA disk.


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ext4, barrier, md/RAID1 and write cache
  2012-05-07 16:44   ` Daniel Pocock
@ 2012-05-07 16:54     ` Andreas Dilger
  2012-05-07 17:28       ` Daniel Pocock
  0 siblings, 1 reply; 14+ messages in thread
From: Andreas Dilger @ 2012-05-07 16:54 UTC (permalink / raw)
  To: Daniel Pocock; +Cc: Martin Steigerwald, linux-ext4

On 2012-05-07, at 10:44 AM, Daniel Pocock wrote:
> On 07/05/12 18:25, Martin Steigerwald wrote:
>> Am Montag, 7. Mai 2012 schrieb Daniel Pocock:
>>> 2x SATA drive (NCQ, 32MB cache, no hardware RAID)
>>> md RAID1
>>> LVM
>>> ext4
>>> 
>>> a) If I use data=ordered,barrier=1 and `hdparm -W 1' on the drive,
>>>    I observe write performance over NFS of 1MB/sec (unpacking a
>>>    big source tarball)
>>> 
>>> b) If I use data=writeback,barrier=0 and `hdparm -W 1' on the drive,
>>>    I observe write performance over NFS of 10MB/sec
>>> 
>>> c) If I just use the async option on NFS, I observe up to 30MB/sec

The only proper way to isolate the cause of performance problems is to test each layer separately.

What is the performance running this workload against the same ext4
filesystem locally (i.e. without NFS)?  How big are the files?  If
you run some kind of low-level benchmark against the underlying MD
RAID array, with synchronous IOPS of the average file size, what is
the performance?

Do you have something like the MD RAID resync bitmaps enabled?  That
can kill performance, though it improves the rebuild time after a
crash.  Putting these bitmaps onto a small SSH, or e.g. a separate
boot disk (if you have one) can improve performance significantly.

>> c) won´t harm local filesystem consistency, but should the nfs server break down all data that the NFS clients sent to the server for
>> writing which is not written yet is gone.
> 
> Most of the access is from NFS, so (c) is not a good solution either.

Well, this behaviour is not significantly worse than applications
writing to a local filesystem, and the node crashing and losing the
dirty data in memory that has not been written to disk.

>>> - or must I just use option (b) but make it safer with battery-backed
>>> write cache?
>> 
>> If you want performance and safety that is the best option from the
>> ones you mentioned, if the workload is really I/O bound on the local filesystem. 
>> 
>> Of course you can try the usual tricks like noatime, remove rsize and 
>> wsize options on the NFS client if they have a new enough kernel (they 
>> autotune to much higher than the often recommended 8192 or 32768 bytes, 
>> look at /proc/mounts), put ext4 journal onto an extra disk to reduce head seeks, check whether enough NFS server threads are running, try a
>> different filesystem and so on.
> 
> One further discovery I made: I decided to eliminate md and LVM.  I had
> enough space to create a 256MB partition on one of the disks, and format
> it directly with ext4
> 
> Writing to that partition from the NFS3 client:
> - less than 500kBytes/sec (for unpacking a tarball of source code)
> - around 50MB/sec (dd if=/dev/zero conv=fsync bs=65536)
> 
> and I then connected an old 5400rpm USB disk to the machine, ran the
> same test from the NFS client:
> - 5MBytes/sec (for unpacking a tarball of source code) - 10x faster than
> the 72k SATA disk

Possibly the older disk is lying about doing cache flushes.  The
wonderful disk manufacturers do that with commodity drives to make
their benchmark numbers look better.  If you run some random IOPS
test against this disk, and it has performance much over 100 IOPS
then it is definitely not doing real cache flushes.

> This last test (comparing my AHCI SATA disk to the USB disk, with no md
> or LVM) makes me think it is not an NFS problem, I feel it is some issue
> with the barriers when used with this AHCI or SATA disk.


Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ext4, barrier, md/RAID1 and write cache
  2012-05-07 16:54     ` Andreas Dilger
@ 2012-05-07 17:28       ` Daniel Pocock
  2012-05-07 18:59         ` Martin Steigerwald
  0 siblings, 1 reply; 14+ messages in thread
From: Daniel Pocock @ 2012-05-07 17:28 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Martin Steigerwald, linux-ext4

On 07/05/12 18:54, Andreas Dilger wrote:
> On 2012-05-07, at 10:44 AM, Daniel Pocock wrote:
>   
>> On 07/05/12 18:25, Martin Steigerwald wrote:
>>     
>>> Am Montag, 7. Mai 2012 schrieb Daniel Pocock:
>>>       
>>>> 2x SATA drive (NCQ, 32MB cache, no hardware RAID)
>>>> md RAID1
>>>> LVM
>>>> ext4
>>>>
>>>> a) If I use data=ordered,barrier=1 and `hdparm -W 1' on the drive,
>>>>    I observe write performance over NFS of 1MB/sec (unpacking a
>>>>    big source tarball)
>>>>
>>>> b) If I use data=writeback,barrier=0 and `hdparm -W 1' on the drive,
>>>>    I observe write performance over NFS of 10MB/sec
>>>>
>>>> c) If I just use the async option on NFS, I observe up to 30MB/sec
>>>>         
> The only proper way to isolate the cause of performance problems is to test each layer separately.
>
> What is the performance running this workload against the same ext4
> filesystem locally (i.e. without NFS)?  How big are the files?  If
> you run some kind of low-level benchmark against the underlying MD
> RAID array, with synchronous IOPS of the average file size, what is
> the performance?
>
>   
- the test file is 5MB compressed, over 100MB uncompressed, many C++
files of varying sizes

- testing it locally is definitely faster - but local disk writes can be
cached more aggressively than writes from an NFS client, so it is not
strictly comparable

> Do you have something like the MD RAID resync bitmaps enabled?  That
> can kill performance, though it improves the rebuild time after a
> crash.  Putting these bitmaps onto a small SSH, or e.g. a separate
> boot disk (if you have one) can improve performance significantly.
>
>   
I've checked /proc/mdstat, it doesn't report any bitmap at all


>>> c) won´t harm local filesystem consistency, but should the nfs server break down all data that the NFS clients sent to the server for
>>> writing which is not written yet is gone.
>>>       
>> Most of the access is from NFS, so (c) is not a good solution either.
>>     
> Well, this behaviour is not significantly worse than applications
> writing to a local filesystem, and the node crashing and losing the
> dirty data in memory that has not been written to disk.
>
>   
A lot of the documents I've seen about NFS performance suggest it is
slightly worse though, because the applications on the client have
received positive responses from fsync()

>>>> - or must I just use option (b) but make it safer with battery-backed
>>>> write cache?
>>>>         
>>> If you want performance and safety that is the best option from the
>>> ones you mentioned, if the workload is really I/O bound on the local filesystem. 
>>>
>>> Of course you can try the usual tricks like noatime, remove rsize and 
>>> wsize options on the NFS client if they have a new enough kernel (they 
>>> autotune to much higher than the often recommended 8192 or 32768 bytes, 
>>> look at /proc/mounts), put ext4 journal onto an extra disk to reduce head seeks, check whether enough NFS server threads are running, try a
>>> different filesystem and so on.
>>>       
>> One further discovery I made: I decided to eliminate md and LVM.  I had
>> enough space to create a 256MB partition on one of the disks, and format
>> it directly with ext4
>>
>> Writing to that partition from the NFS3 client:
>> - less than 500kBytes/sec (for unpacking a tarball of source code)
>> - around 50MB/sec (dd if=/dev/zero conv=fsync bs=65536)
>>
>> and I then connected an old 5400rpm USB disk to the machine, ran the
>> same test from the NFS client:
>> - 5MBytes/sec (for unpacking a tarball of source code) - 10x faster than
>> the 72k SATA disk
>>     
> Possibly the older disk is lying about doing cache flushes.  The
> wonderful disk manufacturers do that with commodity drives to make
> their benchmark numbers look better.  If you run some random IOPS
> test against this disk, and it has performance much over 100 IOPS
> then it is definitely not doing real cache flushes.
>
>   

I would agree that is possible - I actually tried using hdparm and
sdparm to check cache status, but they don't work with the USB drive

I've tried the following directly onto the raw device:

dd if=/dev/zero of=/dev/sdc1 bs=4096 count=65536 conv=fsync
29.2MB/s

and iostat reported avg 250 write/sec, avgrq-sz = 237, wkB/s = 30MB/sec

I tried a smaller write as well (just count=1024, total 4MB of data) and
it also reported a slower speed, which suggests that it really is
writing the data out to disk and not just caching.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ext4, barrier, md/RAID1 and write cache
  2012-05-07 17:28       ` Daniel Pocock
@ 2012-05-07 18:59         ` Martin Steigerwald
  2012-05-07 20:56           ` Daniel Pocock
  0 siblings, 1 reply; 14+ messages in thread
From: Martin Steigerwald @ 2012-05-07 18:59 UTC (permalink / raw)
  To: Daniel Pocock; +Cc: Andreas Dilger, linux-ext4

Am Montag, 7. Mai 2012 schrieb Daniel Pocock:
> > Possibly the older disk is lying about doing cache flushes.  The
> > wonderful disk manufacturers do that with commodity drives to make
> > their benchmark numbers look better.  If you run some random IOPS
> > test against this disk, and it has performance much over 100 IOPS
> > then it is definitely not doing real cache flushes.
[…]
> I would agree that is possible - I actually tried using hdparm and
> sdparm to check cache status, but they don't work with the USB drive
> 
> I've tried the following directly onto the raw device:
> 
> dd if=/dev/zero of=/dev/sdc1 bs=4096 count=65536 conv=fsync
> 29.2MB/s

Thats no random I/O IOPS benchmark,

> and iostat reported avg 250 write/sec, avgrq-sz = 237, wkB/s = 30MB/sec

but a sequential workload that gives the I/O scheduler oppurtunity to 
combine write requests.

Also its using pagecache, as conv=fsync only includes the fsync() at the 
end of dd´ing.
 
> I tried a smaller write as well (just count=1024, total 4MB of data)
> and it also reported a slower speed, which suggests that it really is
> writing the data out to disk and not just caching.

I think an IOPS benchmark would be better. I.e. something like:

/usr/share/doc/fio/examples/ssd-test

(from flexible I/O tester debian package, also included in upstream tarball 
of course)

adapted to your needs.

Maybe with different iodepth or numjobs (to simulate several threads 
generating higher iodepths). With iodepth=1 I have seen 54 IOPS on a 
Hitachi 5400 rpm harddisk connected via eSATA.

Important is direct=1 to bypass the pagecache.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ext4, barrier, md/RAID1 and write cache
  2012-05-07 18:59         ` Martin Steigerwald
@ 2012-05-07 20:56           ` Daniel Pocock
  2012-05-07 22:24             ` Martin Steigerwald
  0 siblings, 1 reply; 14+ messages in thread
From: Daniel Pocock @ 2012-05-07 20:56 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Andreas Dilger, linux-ext4

On 07/05/12 20:59, Martin Steigerwald wrote:
> Am Montag, 7. Mai 2012 schrieb Daniel Pocock:
>   
>>> Possibly the older disk is lying about doing cache flushes.  The
>>> wonderful disk manufacturers do that with commodity drives to make
>>> their benchmark numbers look better.  If you run some random IOPS
>>> test against this disk, and it has performance much over 100 IOPS
>>> then it is definitely not doing real cache flushes.
>>>       
> […]
>   
> I think an IOPS benchmark would be better. I.e. something like:
>
> /usr/share/doc/fio/examples/ssd-test
>
> (from flexible I/O tester debian package, also included in upstream tarball 
> of course)
>
> adapted to your needs.
>
> Maybe with different iodepth or numjobs (to simulate several threads 
> generating higher iodepths). With iodepth=1 I have seen 54 IOPS on a 
> Hitachi 5400 rpm harddisk connected via eSATA.
>
> Important is direct=1 to bypass the pagecache.
>
>   
Thanks for suggesting this tool, I've run it against the USB disk and an
LV on my AHCI/SATA/md array

Incidentally, I upgraded the Seagate firmware (model 7200.12 from CC34
to CC49) and one of the disks went offline shortly after I brought the
system back up.  To avoid the risk that a bad drive might interfere with
the SATA performance, I completely removed it before running any tests. 
Tomorrow I'm out to buy some enterprise grade drives, I'm thinking about
Seagate Constellation SATA or even SAS.

Anyway, onto the test results:

USB disk (Seagate  9SD2A3-500 320GB):

rand-write: (groupid=3, jobs=1): err= 0: pid=22519
  write: io=46680KB, bw=796512B/s, iops=194, runt= 60012msec
    slat (usec): min=13, max=25264, avg=106.02, stdev=525.18
    clat (usec): min=993, max=103568, avg=20444.19, stdev=11622.11
    bw (KB/s) : min=  521, max= 1224, per=100.06%, avg=777.48, stdev=97.07
  cpu          : usr=0.73%, sys=2.33%, ctx=12024, majf=0, minf=20
  IO depths    : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
     issued r/w: total=0/11670, short=0/0
     lat (usec): 1000=0.01%
     lat (msec): 2=0.01%, 4=0.24%, 10=2.75%, 20=64.64%, 50=29.97%
     lat (msec): 100=2.31%, 250=0.08%



and from the SATA disk on the AHCI controller
- Barracuda 7200.12  ST31000528AS connected to
- AMD RS785E/SB820M chipset, (lspci reports SB700/SB800 AHCI mode)

rand-write: (groupid=3, jobs=1): err= 0: pid=23038
  write: io=46512KB, bw=793566B/s, iops=193, runt= 60018msec
    slat (usec): min=13, max=35317, avg=97.09, stdev=541.14
    clat (msec): min=2, max=214, avg=20.53, stdev=18.56
    bw (KB/s) : min=    0, max=  882, per=98.54%, avg=762.72, stdev=114.51
  cpu          : usr=0.85%, sys=2.27%, ctx=11972, majf=0, minf=21
  IO depths    : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
     issued r/w: total=0/11628, short=0/0

     lat (msec): 4=1.81%, 10=32.65%, 20=31.30%, 50=26.82%, 100=6.71%
     lat (msec): 250=0.71%




The IOPS scores look similar, but I checked carefully and I'm fairly
certain the disks were mounted correctly when the tests ran.

Should I run this tool over NFS, will the results be meaningful?

Given the need to replace a drive anyway, I'm really thinking about one
of the following approaches:
- same controller, upgrade to enterprise SATA drives
- buy a dedicated SAS/SATA controller, upgrade to enterprise SATA drives
- buy a dedicated SAS/SATA controller, upgrade to SAS drives

My HP N36L is quite small, one PCIe x16 slot, the internal drive cage
has an SFF-8087 (mini SAS) plug, so I'm thinking I can grab something
small like the Adaptec 1405 - will any of these solutions offer a
definite win with my NFS issues though?

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ext4, barrier, md/RAID1 and write cache
  2012-05-07 20:56           ` Daniel Pocock
@ 2012-05-07 22:24             ` Martin Steigerwald
  2012-05-07 23:23               ` Daniel Pocock
  0 siblings, 1 reply; 14+ messages in thread
From: Martin Steigerwald @ 2012-05-07 22:24 UTC (permalink / raw)
  To: Daniel Pocock; +Cc: Andreas Dilger, linux-ext4

Am Montag, 7. Mai 2012 schrieb Daniel Pocock:
> On 07/05/12 20:59, Martin Steigerwald wrote:
> > Am Montag, 7. Mai 2012 schrieb Daniel Pocock:
> >>> Possibly the older disk is lying about doing cache flushes.  The
> >>> wonderful disk manufacturers do that with commodity drives to make
> >>> their benchmark numbers look better.  If you run some random IOPS
> >>> test against this disk, and it has performance much over 100 IOPS
> >>> then it is definitely not doing real cache flushes.
> > 
> > […]
> > 
> > I think an IOPS benchmark would be better. I.e. something like:
> > 
> > /usr/share/doc/fio/examples/ssd-test
> > 
> > (from flexible I/O tester debian package, also included in upstream
> > tarball of course)
> > 
> > adapted to your needs.
> > 
> > Maybe with different iodepth or numjobs (to simulate several threads
> > generating higher iodepths). With iodepth=1 I have seen 54 IOPS on a
> > Hitachi 5400 rpm harddisk connected via eSATA.
> > 
> > Important is direct=1 to bypass the pagecache.
> 
> Thanks for suggesting this tool, I've run it against the USB disk and
> an LV on my AHCI/SATA/md array
> 
> Incidentally, I upgraded the Seagate firmware (model 7200.12 from CC34
> to CC49) and one of the disks went offline shortly after I brought the
> system back up.  To avoid the risk that a bad drive might interfere
> with the SATA performance, I completely removed it before running any
> tests. Tomorrow I'm out to buy some enterprise grade drives, I'm
> thinking about Seagate Constellation SATA or even SAS.
> 
> Anyway, onto the test results:
> 
> USB disk (Seagate  9SD2A3-500 320GB):
> 
> rand-write: (groupid=3, jobs=1): err= 0: pid=22519
>   write: io=46680KB, bw=796512B/s, iops=194, runt= 60012msec
>     slat (usec): min=13, max=25264, avg=106.02, stdev=525.18
>     clat (usec): min=993, max=103568, avg=20444.19, stdev=11622.11
>     bw (KB/s) : min=  521, max= 1224, per=100.06%, avg=777.48,
> stdev=97.07 cpu          : usr=0.73%, sys=2.33%, ctx=12024, majf=0,
> minf=20 IO depths    : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%,
> 32=0.0%,

Please repeat the test with iodepth=1.

194 IOPS appears to be highly unrealistic unless NCQ or something like 
that is in use. At least if thats a 5400/7200 RPM sata drive (didn´t check 
vendor information).

iodepth=1 should give you what the hardware is capable without request 
queueing and reordering involved.

> The IOPS scores look similar, but I checked carefully and I'm fairly
> certain the disks were mounted correctly when the tests ran.
> 
> Should I run this tool over NFS, will the results be meaningful?
> 
> Given the need to replace a drive anyway, I'm really thinking about one
> of the following approaches:
> - same controller, upgrade to enterprise SATA drives
> - buy a dedicated SAS/SATA controller, upgrade to enterprise SATA
> drives
> - buy a dedicated SAS/SATA controller, upgrade to SAS drives
> 
> My HP N36L is quite small, one PCIe x16 slot, the internal drive cage
> has an SFF-8087 (mini SAS) plug, so I'm thinking I can grab something
> small like the Adaptec 1405 - will any of these solutions offer a
> definite win with my NFS issues though?

First I would like to understand more closely what your NFS issues are. 
Before throwing money at the problem its important to understand what the 
problem actually is.

Anyway, 15000 RPM SAS drives should give you more IOPS than 7200 RPM SATA 
drives, but SATA drives are cheaper and thus you could - depending on RAID 
level - increase IOPS by just using more drives.

But still first I´d like to understand *why* its slow.

What does

iostat -x -d -m 5
vmstat 5

say when excersing the slow (and probably a faster) setup? See [1].

[1] 
http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

(quite some of this should be relevant when reporting with ext4 as well)

As for testing with NFS: I except the values to drop. NFS has quite some 
protocol overhead due to network roundtrips. On my nasic tests NFSv4 even 
more so than NFSv3. As for NFS I suggest trying nfsiostat python script 
from newer nfs-utils. It also shows latencies. 

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ext4, barrier, md/RAID1 and write cache
  2012-05-07 22:24             ` Martin Steigerwald
@ 2012-05-07 23:23               ` Daniel Pocock
  2012-05-08 14:55                 ` Martin Steigerwald
  0 siblings, 1 reply; 14+ messages in thread
From: Daniel Pocock @ 2012-05-07 23:23 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Andreas Dilger, linux-ext4

On 08/05/12 00:24, Martin Steigerwald wrote:
> Am Montag, 7. Mai 2012 schrieb Daniel Pocock:
>   
>> On 07/05/12 20:59, Martin Steigerwald wrote:
>>     
>>> Am Montag, 7. Mai 2012 schrieb Daniel Pocock:
>>>       
>>>>> Possibly the older disk is lying about doing cache flushes.  The
>>>>> wonderful disk manufacturers do that with commodity drives to make
>>>>> their benchmark numbers look better.  If you run some random IOPS
>>>>> test against this disk, and it has performance much over 100 IOPS
>>>>> then it is definitely not doing real cache flushes.
>>>>>           
>>> […]
>>>
>>> I think an IOPS benchmark would be better. I.e. something like:
>>>
>>> /usr/share/doc/fio/examples/ssd-test
>>>
>>> (from flexible I/O tester debian package, also included in upstream
>>> tarball of course)
>>>
>>> adapted to your needs.
>>>
>>> Maybe with different iodepth or numjobs (to simulate several threads
>>> generating higher iodepths). With iodepth=1 I have seen 54 IOPS on a
>>> Hitachi 5400 rpm harddisk connected via eSATA.
>>>
>>> Important is direct=1 to bypass the pagecache.
>>>       
>> Thanks for suggesting this tool, I've run it against the USB disk and
>> an LV on my AHCI/SATA/md array
>>
>> Incidentally, I upgraded the Seagate firmware (model 7200.12 from CC34
>> to CC49) and one of the disks went offline shortly after I brought the
>> system back up.  To avoid the risk that a bad drive might interfere
>> with the SATA performance, I completely removed it before running any
>> tests. Tomorrow I'm out to buy some enterprise grade drives, I'm
>> thinking about Seagate Constellation SATA or even SAS.
>>
>> Anyway, onto the test results:
>>
>> USB disk (Seagate  9SD2A3-500 320GB):
>>
>> rand-write: (groupid=3, jobs=1): err= 0: pid=22519
>>   write: io=46680KB, bw=796512B/s, iops=194, runt= 60012msec
>>     slat (usec): min=13, max=25264, avg=106.02, stdev=525.18
>>     clat (usec): min=993, max=103568, avg=20444.19, stdev=11622.11
>>     bw (KB/s) : min=  521, max= 1224, per=100.06%, avg=777.48,
>> stdev=97.07 cpu          : usr=0.73%, sys=2.33%, ctx=12024, majf=0,
>> minf=20 IO depths    : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%,
>> 32=0.0%,
>>     
> Please repeat the test with iodepth=1.
>   
For the USB device:

rand-write: (groupid=3, jobs=1): err= 0: pid=11855
  write: io=49320KB, bw=841713B/s, iops=205, runt= 60001msec
    slat (usec): min=67, max=6234, avg=112.62, stdev=136.92
    clat (usec): min=684, max=97358, avg=4737.20, stdev=4824.08
    bw (KB/s) : min=  588, max= 1029, per=100.46%, avg=824.74, stdev=84.47
  cpu          : usr=0.64%, sys=2.89%, ctx=12751, majf=0, minf=21
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
     issued r/w: total=0/12330, short=0/0
     lat (usec): 750=0.02%, 1000=0.48%
     lat (msec): 2=1.05%, 4=66.65%, 10=26.32%, 20=1.46%, 50=3.99%
     lat (msec): 100=0.03%

and for the SATA disk:

rand-write: (groupid=3, jobs=1): err= 0: pid=12256
  write: io=28020KB, bw=478168B/s, iops=116, runt= 60005msec
    slat (usec): min=58, max=132637, avg=110.51, stdev=1623.80
    clat (msec): min=2, max=206, avg= 8.44, stdev= 7.10
    bw (KB/s) : min=   95, max=  566, per=100.24%, avg=467.11, stdev=97.64
  cpu          : usr=0.36%, sys=1.17%, ctx=7196, majf=0, minf=21
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
     issued r/w: total=0/7005, short=0/0

     lat (msec): 4=6.31%, 10=69.54%, 20=22.68%, 50=0.63%, 100=0.76%
     lat (msec): 250=0.09%



> 194 IOPS appears to be highly unrealistic unless NCQ or something like 
> that is in use. At least if thats a 5400/7200 RPM sata drive (didn´t check 
> vendor information).
>
>   
The SATA disk does have NCQ

USB disk is supposed to be 5400RPM, USB2, but reporting iops=205

SATA disk is 7200 RPM, 3 Gigabit SATA, but reporting iops=116

Does this suggest that the USB disk is caching data but telling Linux
the data is on disk?

>> The IOPS scores look similar, but I checked carefully and I'm fairly
>> certain the disks were mounted correctly when the tests ran.
>>
>> Should I run this tool over NFS, will the results be meaningful?
>>
>> Given the need to replace a drive anyway, I'm really thinking about one
>> of the following approaches:
>> - same controller, upgrade to enterprise SATA drives
>> - buy a dedicated SAS/SATA controller, upgrade to enterprise SATA
>> drives
>> - buy a dedicated SAS/SATA controller, upgrade to SAS drives
>>
>> My HP N36L is quite small, one PCIe x16 slot, the internal drive cage
>> has an SFF-8087 (mini SAS) plug, so I'm thinking I can grab something
>> small like the Adaptec 1405 - will any of these solutions offer a
>> definite win with my NFS issues though?
>>     
> First I would like to understand more closely what your NFS issues are. 
> Before throwing money at the problem its important to understand what the 
> problem actually is.
>
>   
When I do things like unpacking a large source tarball, iostat reports
throughput to the drive between 500-1000kBytes/second

When I do the same operation onto the USB drive over NFS, I see over
5000kBytes/second - but it appears from the iops test figures that the
USB drive is cheating, so we'll ignore that.

- if I just dd to the SATA drive over NFS (with conv=fsync), I see much
faster speeds
- if I'm logged in to the server, and I unpack the same tarball onto the
same LV, the operation completes at 30MBytes/sec

It is a gigabit network and I think that the performance of the dd
command proves it is not something silly like a cable fault (I have come
across such faults elsewhere though)

> Anyway, 15000 RPM SAS drives should give you more IOPS than 7200 RPM SATA 
> drives, but SATA drives are cheaper and thus you could - depending on RAID 
> level - increase IOPS by just using more drives.
>
>   
I was thinking about the large (2TB or 3TB) 7200 RPM SAS or SATA drives
in the Seagate `Constellation' enterprise drive range.  I need more
space anyway, and I need to replace the drive that failed, so I have to
spend some money anyway - I just want to throw it in the right direction
(e.g. buying a drive, or if the cheap on-board SATA controller is a
bottleneck or just extremely unsophisticated, I don't mind getting a
dedicated controller)

For example, if I knew that the controller is simply not suitable with
barriers, NFS, etc and that a $200 RAID card or even a $500 RAID card
will guarantee better performance with my current kernel, I would buy
that.  (However, I do want to use md RAID rather than a proprietary
format, so any RAID card would be in JBOD mode)

> But still first I´d like to understand *why* its slow.
>
> What does
>
> iostat -x -d -m 5
> vmstat 5
>
> say when excersing the slow (and probably a faster) setup? See [1].
>
>   
All the iostat output is typically like this:
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await  svctm  %util
dm-23             0.00     0.00    0.20  187.60     0.00     0.81    
8.89     2.02   10.79   5.07  95.20
dm-23             0.00     0.00    0.20  189.80     0.00     0.91    
9.84     1.95   10.29   4.97  94.48
dm-23             0.00     0.00    0.20  228.60     0.00     1.00    
8.92     1.97    8.58   4.10  93.92
dm-23             0.00     0.00    0.20  231.80     0.00     0.98    
8.70     1.96    8.49   4.06  94.16
dm-23             0.00     0.00    0.20  229.20     0.00     0.94    
8.40     1.92    8.39   4.10  94.08

and vmstat:

procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
id wa
...
 0  1      0 6881772 118660 576712    0    0     1  1033  720 1553  0  2
60 38
 0  1      0 6879068 120220 577892    0    0     1   918  793 1595  0  2
56 41
 0  1      0 6876208 122200 578684    0    0     1  1055  767 1731  0  2
67 31
 1  1      0 6873356 124176 579392    0    0     1  1014  742 1688  0  2
66 32
 0  1      0 6870628 126132 579904    0    0     1  1007  753 1683  0  2
66 32

and nfsstat -s -o all -l -Z5

nfs v3 server        total:      319
------------- ------------- --------
nfs v3 server      getattr:        1
nfs v3 server      setattr:      126
nfs v3 server       access:        6
nfs v3 server        write:       61
nfs v3 server       create:       61
nfs v3 server        mkdir:        3
nfs v3 server       commit:       61


> [1] 
> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
>
>   

I've also tested onto btrfs and the performance was equally bad, so it
may not be an ext4 issue

The environment is:
Linux srv1 3.2.0-0.bpo.2-amd64 #1 SMP Mon Apr 23 08:38:01 UTC 2012
x86_64 GNU/Linux
(Debian squeeze)
Kernel NFS v3
HP N36L server, onboard AHCI
 md RAID1 as a 1TB device (/dev/md2)
/dev/md2 is a PV for LVM - no other devices attached

As mentioned before, I've tried with and without write cache.
dmesg reports that ext4 (and btrfs) seem to be happy to accept the
barrier=1 or barrier=0 setting with the drives.
dmesg and hdparm also appear to report accurate information about write
cache status.

> (quite some of this should be relevant when reporting with ext4 as well)
>
> As for testing with NFS: I except the values to drop. NFS has quite some 
> protocol overhead due to network roundtrips. On my nasic tests NFSv4 even 
> more so than NFSv3. As for NFS I suggest trying nfsiostat python script 
> from newer nfs-utils. It also shows latencies. 
>   

I agree - but 500kBytes/sec is just so much slower than anything I've
seen with any IO device in recent years.  I don't expect to get 90% of
the performance of a local disk, but is getting 30-50% reasonable?


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ext4, barrier, md/RAID1 and write cache
  2012-05-07 23:23               ` Daniel Pocock
@ 2012-05-08 14:55                 ` Martin Steigerwald
  2012-05-08 15:28                   ` Daniel Pocock
  0 siblings, 1 reply; 14+ messages in thread
From: Martin Steigerwald @ 2012-05-08 14:55 UTC (permalink / raw)
  To: Daniel Pocock; +Cc: Martin Steigerwald, Andreas Dilger, linux-ext4

Am Dienstag, 8. Mai 2012 schrieb Daniel Pocock:
> On 08/05/12 00:24, Martin Steigerwald wrote:
> > Am Montag, 7. Mai 2012 schrieb Daniel Pocock:
> >> On 07/05/12 20:59, Martin Steigerwald wrote:
> >>> Am Montag, 7. Mai 2012 schrieb Daniel Pocock:
> >>>>> Possibly the older disk is lying about doing cache flushes.  The
> >>>>> wonderful disk manufacturers do that with commodity drives to make
> >>>>> their benchmark numbers look better.  If you run some random IOPS
> >>>>> test against this disk, and it has performance much over 100 IOPS
> >>>>> then it is definitely not doing real cache flushes.
> >>> 
> >>> […]
> >>> 
> >>> I think an IOPS benchmark would be better. I.e. something like:
> >>> 
> >>> /usr/share/doc/fio/examples/ssd-test
> >>> 
> >>> (from flexible I/O tester debian package, also included in upstream
> >>> tarball of course)
> >>> 
> >>> adapted to your needs.
> >>> 
> >>> Maybe with different iodepth or numjobs (to simulate several threads
> >>> generating higher iodepths). With iodepth=1 I have seen 54 IOPS on a
> >>> Hitachi 5400 rpm harddisk connected via eSATA.
> >>> 
> >>> Important is direct=1 to bypass the pagecache.
> >> 
> >> Thanks for suggesting this tool, I've run it against the USB disk and
> >> an LV on my AHCI/SATA/md array
> >> 
> >> Incidentally, I upgraded the Seagate firmware (model 7200.12 from CC34
> >> to CC49) and one of the disks went offline shortly after I brought the
> >> system back up.  To avoid the risk that a bad drive might interfere
> >> with the SATA performance, I completely removed it before running any
> >> tests. Tomorrow I'm out to buy some enterprise grade drives, I'm
> >> thinking about Seagate Constellation SATA or even SAS.
> >> 
> >> Anyway, onto the test results:
> >> 
> >> USB disk (Seagate  9SD2A3-500 320GB):
> >> 
> >> rand-write: (groupid=3, jobs=1): err= 0: pid=22519
> >> 
> >>   write: io=46680KB, bw=796512B/s, iops=194, runt= 60012msec
> >>   
> >>     slat (usec): min=13, max=25264, avg=106.02, stdev=525.18
> >>     clat (usec): min=993, max=103568, avg=20444.19, stdev=11622.11
> >>     bw (KB/s) : min=  521, max= 1224, per=100.06%, avg=777.48,
> >> 
> >> stdev=97.07 cpu          : usr=0.73%, sys=2.33%, ctx=12024, majf=0,
> >> minf=20 IO depths    : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%,
> >> 32=0.0%,
> > 
> > Please repeat the test with iodepth=1.
> 
> For the USB device:
> 
> rand-write: (groupid=3, jobs=1): err= 0: pid=11855
>   write: io=49320KB, bw=841713B/s, iops=205, runt= 60001msec
>     slat (usec): min=67, max=6234, avg=112.62, stdev=136.92
>     clat (usec): min=684, max=97358, avg=4737.20, stdev=4824.08
>     bw (KB/s) : min=  588, max= 1029, per=100.46%, avg=824.74, stdev=84.47
>   cpu          : usr=0.64%, sys=2.89%, ctx=12751, majf=0, minf=21
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> 
> >=64=0.0%
> 
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> 
> >=64=0.0%
> 
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> 
> >=64=0.0%
> 
>      issued r/w: total=0/12330, short=0/0
>      lat (usec): 750=0.02%, 1000=0.48%
>      lat (msec): 2=1.05%, 4=66.65%, 10=26.32%, 20=1.46%, 50=3.99%
>      lat (msec): 100=0.03%
> 
> and for the SATA disk:
> 
> rand-write: (groupid=3, jobs=1): err= 0: pid=12256
>   write: io=28020KB, bw=478168B/s, iops=116, runt= 60005msec
>     slat (usec): min=58, max=132637, avg=110.51, stdev=1623.80
>     clat (msec): min=2, max=206, avg= 8.44, stdev= 7.10
>     bw (KB/s) : min=   95, max=  566, per=100.24%, avg=467.11, stdev=97.64
>   cpu          : usr=0.36%, sys=1.17%, ctx=7196, majf=0, minf=21
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
[…]
>      issued r/w: total=0/7005, short=0/0
> 
>      lat (msec): 4=6.31%, 10=69.54%, 20=22.68%, 50=0.63%, 100=0.76%
>      lat (msec): 250=0.09%
> 
> > 194 IOPS appears to be highly unrealistic unless NCQ or something like
> > that is in use. At least if thats a 5400/7200 RPM sata drive (didn´t
> > check vendor information).
> 
> The SATA disk does have NCQ
> 
> USB disk is supposed to be 5400RPM, USB2, but reporting iops=205
> 
> SATA disk is 7200 RPM, 3 Gigabit SATA, but reporting iops=116
> 
> Does this suggest that the USB disk is caching data but telling Linux
> the data is on disk?

Looks like it.

Some older values for a 1.5 TB WD Green Disk:

mango:~# fio -readonly -name iops -rw=randread -bs=512  -runtime=100 -iodepth 1 
-filename /dev/sda -ioengine  libaio -direct=1
[...] iops: (groupid=0, jobs=1): err= 0: pid=9939
  read : io=1,859KB, bw=19,031B/s, iops=37, runt=100024msec [...]</pre>


mango:~# fio -readonly -name iops -rw=randread -bs=512  -runtime=100 -iodepth 
32 -filename /dev/sda -ioengine  libaio -direct=1
iops: (groupid=0, jobs=1): err= 0: pid=10304
  read : io=2,726KB, bw=27,842B/s, iops=54, runt=100257msec

mango:~# hdparm -I /dev/sda | grep -i queue
        Queue depth: 32
           *    Native Command Queueing (NCQ)

- 1,5 TB Western Digital, WDC WD15EADS-00P8B0
- Pentium 4 mit 2,80 GHz
- 4 GB RAM, 32-Bit Linux
- Linux Kernel 2.6.36
- fio 1.38-1

> >> The IOPS scores look similar, but I checked carefully and I'm fairly
> >> certain the disks were mounted correctly when the tests ran.
> >> 
> >> Should I run this tool over NFS, will the results be meaningful?
> >> 
> >> Given the need to replace a drive anyway, I'm really thinking about one
> >> of the following approaches:
> >> - same controller, upgrade to enterprise SATA drives
> >> - buy a dedicated SAS/SATA controller, upgrade to enterprise SATA
> >> drives
> >> - buy a dedicated SAS/SATA controller, upgrade to SAS drives
> >> 
> >> My HP N36L is quite small, one PCIe x16 slot, the internal drive cage
> >> has an SFF-8087 (mini SAS) plug, so I'm thinking I can grab something
> >> small like the Adaptec 1405 - will any of these solutions offer a
> >> definite win with my NFS issues though?
> > 
> > First I would like to understand more closely what your NFS issues are.
> > Before throwing money at the problem its important to understand what the
> > problem actually is.
> 
> When I do things like unpacking a large source tarball, iostat reports
> throughput to the drive between 500-1000kBytes/second
> 
> When I do the same operation onto the USB drive over NFS, I see over
> 5000kBytes/second - but it appears from the iops test figures that the
> USB drive is cheating, so we'll ignore that.
> 
> - if I just dd to the SATA drive over NFS (with conv=fsync), I see much
> faster speeds

Easy. Less roundtrips.

Just watch nfsstat -3 while untarring a tarball over NFS to see what I mean.

> - if I'm logged in to the server, and I unpack the same tarball onto the
> same LV, the operation completes at 30MBytes/sec

No network.

Thats the LV on the internal disk?

> It is a gigabit network and I think that the performance of the dd
> command proves it is not something silly like a cable fault (I have come
> across such faults elsewhere though)

What is the latency?

> > Anyway, 15000 RPM SAS drives should give you more IOPS than 7200 RPM SATA
> > drives, but SATA drives are cheaper and thus you could - depending on
> > RAID level - increase IOPS by just using more drives.
> 
> I was thinking about the large (2TB or 3TB) 7200 RPM SAS or SATA drives
> in the Seagate `Constellation' enterprise drive range.  I need more
> space anyway, and I need to replace the drive that failed, so I have to
> spend some money anyway - I just want to throw it in the right direction
> (e.g. buying a drive, or if the cheap on-board SATA controller is a
> bottleneck or just extremely unsophisticated, I don't mind getting a
> dedicated controller)
> 
> For example, if I knew that the controller is simply not suitable with
> barriers, NFS, etc and that a $200 RAID card or even a $500 RAID card
> will guarantee better performance with my current kernel, I would buy
> that.  (However, I do want to use md RAID rather than a proprietary
> format, so any RAID card would be in JBOD mode)

They point is: How much of the performance will arrive at NFS? I can't say 
yet.

> > But still first I´d like to understand *why* its slow.
> > 
> > What does
> > 
> > iostat -x -d -m 5
> > vmstat 5
> > 
> > say when excersing the slow (and probably a faster) setup? See [1].
> 
> All the iostat output is typically like this:
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> dm-23             0.00     0.00    0.20  187.60     0.00     0.81
> 8.89     2.02   10.79   5.07  95.20
> dm-23             0.00     0.00    0.20  189.80     0.00     0.91
> 9.84     1.95   10.29   4.97  94.48
> dm-23             0.00     0.00    0.20  228.60     0.00     1.00
> 8.92     1.97    8.58   4.10  93.92
> dm-23             0.00     0.00    0.20  231.80     0.00     0.98
> 8.70     1.96    8.49   4.06  94.16
> dm-23             0.00     0.00    0.20  229.20     0.00     0.94
> 8.40     1.92    8.39   4.10  94.08

Hmmm, disk looks quite utilitzed. Are there other I/O workloads on the 
machine?

> and vmstat:
> 
> procs -----------memory---------- ---swap-- -----io---- -system--
> ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
> id wa
> ...
>  0  1      0 6881772 118660 576712    0    0     1  1033  720 1553  0  2
> 60 38
>  0  1      0 6879068 120220 577892    0    0     1   918  793 1595  0  2
> 56 41
>  0  1      0 6876208 122200 578684    0    0     1  1055  767 1731  0  2
> 67 31
>  1  1      0 6873356 124176 579392    0    0     1  1014  742 1688  0  2
> 66 32
>  0  1      0 6870628 126132 579904    0    0     1  1007  753 1683  0  2
> 66 32

And wait I/O is quite high.

Thus it seems this workload can be faster with faster / more disks or a RAID 
controller with battery (and disabling barriers / cache flushes).

> and nfsstat -s -o all -l -Z5
> 
> nfs v3 server        total:      319
> ------------- ------------- --------
> nfs v3 server      getattr:        1
> nfs v3 server      setattr:      126
> nfs v3 server       access:        6
> nfs v3 server        write:       61
> nfs v3 server       create:       61
> nfs v3 server        mkdir:        3
> nfs v3 server       commit:       61

I would like to see nfsiostat from newer nfs-utils, cause it includes 
latencies.

> > [1]
> > http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_whe
> > n_reporting_a_problem.3F
> 
> I've also tested onto btrfs and the performance was equally bad, so it
> may not be an ext4 issue
> 
> The environment is:
> Linux srv1 3.2.0-0.bpo.2-amd64 #1 SMP Mon Apr 23 08:38:01 UTC 2012
> x86_64 GNU/Linux
> (Debian squeeze)
> Kernel NFS v3
> HP N36L server, onboard AHCI
>  md RAID1 as a 1TB device (/dev/md2)
> /dev/md2 is a PV for LVM - no other devices attached
> 
> As mentioned before, I've tried with and without write cache.
> dmesg reports that ext4 (and btrfs) seem to be happy to accept the
> barrier=1 or barrier=0 setting with the drives.

3.2 doesn't report failure on barriers anymore. Barriers have been switched to 
cache flush requests and these will not report back failure. So you have to 
make sure cache flushes work in other ways.

> dmesg and hdparm also appear to report accurate information about write
> cache status.
> 
> > (quite some of this should be relevant when reporting with ext4 as well)
> > 
> > As for testing with NFS: I except the values to drop. NFS has quite some
> > protocol overhead due to network roundtrips. On my nasic tests NFSv4 even
> > more so than NFSv3. As for NFS I suggest trying nfsiostat python script
> > from newer nfs-utils. It also shows latencies.
> 
> I agree - but 500kBytes/sec is just so much slower than anything I've
> seen with any IO device in recent years.  I don't expect to get 90% of
> the performance of a local disk, but is getting 30-50% reasonable?

Depends on the workload.

You might consider using FS-Cache with cachefilesd for local client side 
caching.

Ciao,
-- 
Martin Steigerwald - teamix GmbH - http://www.teamix.de
gpg: 19E3 8D42 896F D004 08AC A0CA 1E10 C593 0399 AE90
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ext4, barrier, md/RAID1 and write cache
  2012-05-08 14:55                 ` Martin Steigerwald
@ 2012-05-08 15:28                   ` Daniel Pocock
  2012-05-08 17:02                     ` Andreas Dilger
  2012-05-09  7:30                     ` Martin Steigerwald
  0 siblings, 2 replies; 14+ messages in thread
From: Daniel Pocock @ 2012-05-08 15:28 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Martin Steigerwald, Andreas Dilger, linux-ext4



On 08/05/12 14:55, Martin Steigerwald wrote:
> Am Dienstag, 8. Mai 2012 schrieb Daniel Pocock:
>> On 08/05/12 00:24, Martin Steigerwald wrote:
>>> Am Montag, 7. Mai 2012 schrieb Daniel Pocock:
>>>> On 07/05/12 20:59, Martin Steigerwald wrote:
>>>>> Am Montag, 7. Mai 2012 schrieb Daniel Pocock:
>>>>>>> Possibly the older disk is lying about doing cache flushes.  The
>>>>>>> wonderful disk manufacturers do that with commodity drives to make
>>>>>>> their benchmark numbers look better.  If you run some random IOPS
>>>>>>> test against this disk, and it has performance much over 100 IOPS
>>>>>>> then it is definitely not doing real cache flushes.
>>>>>
>>>>> […]
>>>>>
>>>>> I think an IOPS benchmark would be better. I.e. something like:
>>>>>
>>>>> /usr/share/doc/fio/examples/ssd-test
>>>>>
>>>>> (from flexible I/O tester debian package, also included in upstream
>>>>> tarball of course)
>>>>>
>>>>> adapted to your needs.
>>>>>
>>>>> Maybe with different iodepth or numjobs (to simulate several threads
>>>>> generating higher iodepths). With iodepth=1 I have seen 54 IOPS on a
>>>>> Hitachi 5400 rpm harddisk connected via eSATA.
>>>>>
>>>>> Important is direct=1 to bypass the pagecache.
>>>>
>>>> Thanks for suggesting this tool, I've run it against the USB disk and
>>>> an LV on my AHCI/SATA/md array
>>>>
>>>> Incidentally, I upgraded the Seagate firmware (model 7200.12 from CC34
>>>> to CC49) and one of the disks went offline shortly after I brought the
>>>> system back up.  To avoid the risk that a bad drive might interfere
>>>> with the SATA performance, I completely removed it before running any
>>>> tests. Tomorrow I'm out to buy some enterprise grade drives, I'm
>>>> thinking about Seagate Constellation SATA or even SAS.
>>>>
>>>> Anyway, onto the test results:
>>>>
>>>> USB disk (Seagate  9SD2A3-500 320GB):
>>>>
>>>> rand-write: (groupid=3, jobs=1): err= 0: pid=22519
>>>>
>>>>   write: io=46680KB, bw=796512B/s, iops=194, runt= 60012msec
>>>>   
>>>>     slat (usec): min=13, max=25264, avg=106.02, stdev=525.18
>>>>     clat (usec): min=993, max=103568, avg=20444.19, stdev=11622.11
>>>>     bw (KB/s) : min=  521, max= 1224, per=100.06%, avg=777.48,
>>>>
>>>> stdev=97.07 cpu          : usr=0.73%, sys=2.33%, ctx=12024, majf=0,
>>>> minf=20 IO depths    : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%,
>>>> 32=0.0%,
>>>
>>> Please repeat the test with iodepth=1.
>>
>> For the USB device:
>>
>> rand-write: (groupid=3, jobs=1): err= 0: pid=11855
>>   write: io=49320KB, bw=841713B/s, iops=205, runt= 60001msec
>>     slat (usec): min=67, max=6234, avg=112.62, stdev=136.92
>>     clat (usec): min=684, max=97358, avg=4737.20, stdev=4824.08
>>     bw (KB/s) : min=  588, max= 1029, per=100.46%, avg=824.74, stdev=84.47
>>   cpu          : usr=0.64%, sys=2.89%, ctx=12751, majf=0, minf=21
>>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>
>>> =64=0.0%
>>
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>
>>> =64=0.0%
>>
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>
>>> =64=0.0%
>>
>>      issued r/w: total=0/12330, short=0/0
>>      lat (usec): 750=0.02%, 1000=0.48%
>>      lat (msec): 2=1.05%, 4=66.65%, 10=26.32%, 20=1.46%, 50=3.99%
>>      lat (msec): 100=0.03%
>>
>> and for the SATA disk:
>>
>> rand-write: (groupid=3, jobs=1): err= 0: pid=12256
>>   write: io=28020KB, bw=478168B/s, iops=116, runt= 60005msec
>>     slat (usec): min=58, max=132637, avg=110.51, stdev=1623.80
>>     clat (msec): min=2, max=206, avg= 8.44, stdev= 7.10
>>     bw (KB/s) : min=   95, max=  566, per=100.24%, avg=467.11, stdev=97.64
>>   cpu          : usr=0.36%, sys=1.17%, ctx=7196, majf=0, minf=21
>>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> […]
>>      issued r/w: total=0/7005, short=0/0
>>
>>      lat (msec): 4=6.31%, 10=69.54%, 20=22.68%, 50=0.63%, 100=0.76%
>>      lat (msec): 250=0.09%
>>
>>> 194 IOPS appears to be highly unrealistic unless NCQ or something like
>>> that is in use. At least if thats a 5400/7200 RPM sata drive (didn´t
>>> check vendor information).
>>
>> The SATA disk does have NCQ
>>
>> USB disk is supposed to be 5400RPM, USB2, but reporting iops=205
>>
>> SATA disk is 7200 RPM, 3 Gigabit SATA, but reporting iops=116
>>
>> Does this suggest that the USB disk is caching data but telling Linux
>> the data is on disk?
> 
> Looks like it.
> 
> Some older values for a 1.5 TB WD Green Disk:
> 
> mango:~# fio -readonly -name iops -rw=randread -bs=512  -runtime=100 -iodepth 1 
> -filename /dev/sda -ioengine  libaio -direct=1
> [...] iops: (groupid=0, jobs=1): err= 0: pid=9939
>   read : io=1,859KB, bw=19,031B/s, iops=37, runt=100024msec [...]</pre>
> 
> 
> mango:~# fio -readonly -name iops -rw=randread -bs=512  -runtime=100 -iodepth 
> 32 -filename /dev/sda -ioengine  libaio -direct=1
> iops: (groupid=0, jobs=1): err= 0: pid=10304
>   read : io=2,726KB, bw=27,842B/s, iops=54, runt=100257msec
> 
> mango:~# hdparm -I /dev/sda | grep -i queue
>         Queue depth: 32
>            *    Native Command Queueing (NCQ)
> 
> - 1,5 TB Western Digital, WDC WD15EADS-00P8B0
> - Pentium 4 mit 2,80 GHz
> - 4 GB RAM, 32-Bit Linux
> - Linux Kernel 2.6.36
> - fio 1.38-1
> 
>>>> The IOPS scores look similar, but I checked carefully and I'm fairly
>>>> certain the disks were mounted correctly when the tests ran.
>>>>
>>>> Should I run this tool over NFS, will the results be meaningful?
>>>>
>>>> Given the need to replace a drive anyway, I'm really thinking about one
>>>> of the following approaches:
>>>> - same controller, upgrade to enterprise SATA drives
>>>> - buy a dedicated SAS/SATA controller, upgrade to enterprise SATA
>>>> drives
>>>> - buy a dedicated SAS/SATA controller, upgrade to SAS drives
>>>>
>>>> My HP N36L is quite small, one PCIe x16 slot, the internal drive cage
>>>> has an SFF-8087 (mini SAS) plug, so I'm thinking I can grab something
>>>> small like the Adaptec 1405 - will any of these solutions offer a
>>>> definite win with my NFS issues though?
>>>
>>> First I would like to understand more closely what your NFS issues are.
>>> Before throwing money at the problem its important to understand what the
>>> problem actually is.
>>
>> When I do things like unpacking a large source tarball, iostat reports
>> throughput to the drive between 500-1000kBytes/second
>>
>> When I do the same operation onto the USB drive over NFS, I see over
>> 5000kBytes/second - but it appears from the iops test figures that the
>> USB drive is cheating, so we'll ignore that.
>>
>> - if I just dd to the SATA drive over NFS (with conv=fsync), I see much
>> faster speeds
> 
> Easy. Less roundtrips.
> 
> Just watch nfsstat -3 while untarring a tarball over NFS to see what I mean.
> 
>> - if I'm logged in to the server, and I unpack the same tarball onto the
>> same LV, the operation completes at 30MBytes/sec
> 
> No network.
> 
> Thats the LV on the internal disk?


Yes

>> It is a gigabit network and I think that the performance of the dd
>> command proves it is not something silly like a cable fault (I have come
>> across such faults elsewhere though)
> 
> What is the latency?
> 
$ ping -s 1000 192.168.1.2
PING 192.168.1.2 (192.168.1.2) 1000(1028) bytes of data.
1008 bytes from 192.168.1.2: icmp_req=1 ttl=64 time=0.307 ms
1008 bytes from 192.168.1.2: icmp_req=2 ttl=64 time=0.341 ms
1008 bytes from 192.168.1.2: icmp_req=3 ttl=64 time=0.336 ms


>>> Anyway, 15000 RPM SAS drives should give you more IOPS than 7200 RPM SATA
>>> drives, but SATA drives are cheaper and thus you could - depending on
>>> RAID level - increase IOPS by just using more drives.
>>
>> I was thinking about the large (2TB or 3TB) 7200 RPM SAS or SATA drives
>> in the Seagate `Constellation' enterprise drive range.  I need more
>> space anyway, and I need to replace the drive that failed, so I have to
>> spend some money anyway - I just want to throw it in the right direction
>> (e.g. buying a drive, or if the cheap on-board SATA controller is a
>> bottleneck or just extremely unsophisticated, I don't mind getting a
>> dedicated controller)
>>
>> For example, if I knew that the controller is simply not suitable with
>> barriers, NFS, etc and that a $200 RAID card or even a $500 RAID card
>> will guarantee better performance with my current kernel, I would buy
>> that.  (However, I do want to use md RAID rather than a proprietary
>> format, so any RAID card would be in JBOD mode)
> 
> They point is: How much of the performance will arrive at NFS? I can't say 
> yet.

My impression is that the faster performance of the USB disk was a red
herring, and the problem really is just the nature of the NFS protocol
and the way it is stricter about server-side caching (when sync is
enabled) and consequently it needs more iops.

I've turned two more machines (a HP Z800 with SATA disk and a Lenovo
X220 with SSD disk) into NFSv3 servers, repeated the same tests, and
found similar performance on the Z800, but 20x faster on the SSD (which
can support more IOPS)

>>> But still first I´d like to understand *why* its slow.
>>>
>>> What does
>>>
>>> iostat -x -d -m 5
>>> vmstat 5
>>>
>>> say when excersing the slow (and probably a faster) setup? See [1].
>>
>> All the iostat output is typically like this:
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> dm-23             0.00     0.00    0.20  187.60     0.00     0.81
>> 8.89     2.02   10.79   5.07  95.20
>> dm-23             0.00     0.00    0.20  189.80     0.00     0.91
>> 9.84     1.95   10.29   4.97  94.48
>> dm-23             0.00     0.00    0.20  228.60     0.00     1.00
>> 8.92     1.97    8.58   4.10  93.92
>> dm-23             0.00     0.00    0.20  231.80     0.00     0.98
>> 8.70     1.96    8.49   4.06  94.16
>> dm-23             0.00     0.00    0.20  229.20     0.00     0.94
>> 8.40     1.92    8.39   4.10  94.08
> 
> Hmmm, disk looks quite utilitzed. Are there other I/O workloads on the 
> machine?

No, just me testing it

>> and vmstat:
>>
>> procs -----------memory---------- ---swap-- -----io---- -system--
>> ----cpu----
>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
>> id wa
>> ...
>>  0  1      0 6881772 118660 576712    0    0     1  1033  720 1553  0  2
>> 60 38
>>  0  1      0 6879068 120220 577892    0    0     1   918  793 1595  0  2
>> 56 41
>>  0  1      0 6876208 122200 578684    0    0     1  1055  767 1731  0  2
>> 67 31
>>  1  1      0 6873356 124176 579392    0    0     1  1014  742 1688  0  2
>> 66 32
>>  0  1      0 6870628 126132 579904    0    0     1  1007  753 1683  0  2
>> 66 32
> 
> And wait I/O is quite high.
> 
> Thus it seems this workload can be faster with faster / more disks or a RAID 
> controller with battery (and disabling barriers / cache flushes).

You mean barrier=0,data=writeback?  Or just barrier=0,data=ordered?

In theory that sounds good, but in practice I understand it creates some
different problems, eg:

- monitoring the battery, replacing it periodically

- batteries only hold the charge for a few hours, so if there is a power
outage on a Sunday, someone tries to turn on the server on  Monday
morning and the battery has died, cache is empty and disk is corrupt

- some RAID controllers (e.g. HP SmartArray) insist on writing their
metadata to all volumes - so you become locked in to the RAID vendor.  I
prefer to just use RAID1 or RAID10 with Linux md onto the raw disks.  On
some Adaptec controllers, `JBOD' mode allows md to access the disks
directly, although I haven't verified that yet.

I'm tempted to just put a UPS on the server and enable NFS `async' mode,
and avoid running anything on the server that may cause a crash.

>> and nfsstat -s -o all -l -Z5
>>
>> nfs v3 server        total:      319
>> ------------- ------------- --------
>> nfs v3 server      getattr:        1
>> nfs v3 server      setattr:      126
>> nfs v3 server       access:        6
>> nfs v3 server        write:       61
>> nfs v3 server       create:       61
>> nfs v3 server        mkdir:        3
>> nfs v3 server       commit:       61
> 
> I would like to see nfsiostat from newer nfs-utils, cause it includes 
> latencies.
> 
>>> [1]
>>> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_whe
>>> n_reporting_a_problem.3F
>>
>> I've also tested onto btrfs and the performance was equally bad, so it
>> may not be an ext4 issue
>>
>> The environment is:
>> Linux srv1 3.2.0-0.bpo.2-amd64 #1 SMP Mon Apr 23 08:38:01 UTC 2012
>> x86_64 GNU/Linux
>> (Debian squeeze)
>> Kernel NFS v3
>> HP N36L server, onboard AHCI
>>  md RAID1 as a 1TB device (/dev/md2)
>> /dev/md2 is a PV for LVM - no other devices attached
>>
>> As mentioned before, I've tried with and without write cache.
>> dmesg reports that ext4 (and btrfs) seem to be happy to accept the
>> barrier=1 or barrier=0 setting with the drives.
> 
> 3.2 doesn't report failure on barriers anymore. Barriers have been switched to 
> cache flush requests and these will not report back failure. So you have to 
> make sure cache flushes work in other ways.
> 
>> dmesg and hdparm also appear to report accurate information about write
>> cache status.
>>
>>> (quite some of this should be relevant when reporting with ext4 as well)
>>>
>>> As for testing with NFS: I except the values to drop. NFS has quite some
>>> protocol overhead due to network roundtrips. On my nasic tests NFSv4 even
>>> more so than NFSv3. As for NFS I suggest trying nfsiostat python script
>>> from newer nfs-utils. It also shows latencies.
>>
>> I agree - but 500kBytes/sec is just so much slower than anything I've
>> seen with any IO device in recent years.  I don't expect to get 90% of
>> the performance of a local disk, but is getting 30-50% reasonable?
> 
> Depends on the workload.
> 
> You might consider using FS-Cache with cachefilesd for local client side 
> caching.
> 
> Ciao,
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ext4, barrier, md/RAID1 and write cache
  2012-05-08 15:28                   ` Daniel Pocock
@ 2012-05-08 17:02                     ` Andreas Dilger
  2012-05-09  7:30                     ` Martin Steigerwald
  1 sibling, 0 replies; 14+ messages in thread
From: Andreas Dilger @ 2012-05-08 17:02 UTC (permalink / raw)
  To: Daniel Pocock; +Cc: Martin Steigerwald, Martin Steigerwald, linux-ext4

On 2012-05-08, at 9:28 AM, Daniel Pocock wrote:
> My impression is that the faster performance of the USB disk was a red
> herring, and the problem really is just the nature of the NFS protocol
> and the way it is stricter about server-side caching (when sync is
> enabled) and consequently it needs more iops.
> 
> I've turned two more machines (a HP Z800 with SATA disk and a Lenovo
> X220 with SSD disk) into NFSv3 servers, repeated the same tests, and
> found similar performance on the Z800, but 20x faster on the SSD (which
> can support more IOPS)

Another possible option is to try "-o data=journal" for the ext4
filesystem.  This will, in theory, turn your random IO workload to
the filesystem into a streaming IO workload to the journal.  This
is only useful if the filesystem is not continually busy, and needs
a large enough journal (and enough RAM to match) to handle the burst
IO loads.

For example, if you are writing 1GB of data you need a 4GB journal
size and 4GB of RAM to allow all of the data to burst into the journal
and write into the filesystem asynchronously.  It it would also be
interesting to see if there is a benefit from running with an external
journal (possibly on a separate disk or an SSD), because then the
synchronous part of the IO does not seek, and then the small IOs can
be safely written to the filesystem asynchronously (they will be
rewritten from the journal if the server crashes).

Typically, data=journal mode will decrease I/O performance by 1/2,
since all data is written twice, but in your case NFS is hurting the
performance far more than this, so the extra "overhead" may still
give better performance visible to the clients.

>>> All the iostat output is typically like this:
>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
>>> avgrq-sz avgqu-sz   await  svctm  %util
>>> dm-23             0.00     0.00    0.20  187.60     0.00     0.81
>>> 8.89     2.02   10.79   5.07  95.20
>>> dm-23             0.00     0.00    0.20  189.80     0.00     0.91
>>> 9.84     1.95   10.29   4.97  94.48
>>> dm-23             0.00     0.00    0.20  228.60     0.00     1.00
>>> 8.92     1.97    8.58   4.10  93.92
>>> dm-23             0.00     0.00    0.20  231.80     0.00     0.98
>>> 8.70     1.96    8.49   4.06  94.16
>>> dm-23             0.00     0.00    0.20  229.20     0.00     0.94
>>> 8.40     1.92    8.39   4.10  94.08
>> 
>> Hmmm, disk looks quite utilitzed. Are there other I/O workloads on the 
>> machine?
> 
> No, just me testing it

Looking at these results, the average IO size is very small.  Looking
at the writes/second of around 210w/s and the write bandwidth of 1MB/s,
this is only an average write size of only 4.5kB.

Cheers, Andreas






^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ext4, barrier, md/RAID1 and write cache
  2012-05-08 15:28                   ` Daniel Pocock
  2012-05-08 17:02                     ` Andreas Dilger
@ 2012-05-09  7:30                     ` Martin Steigerwald
  2012-05-09  9:34                       ` Martin Steigerwald
  1 sibling, 1 reply; 14+ messages in thread
From: Martin Steigerwald @ 2012-05-09  7:30 UTC (permalink / raw)
  To: Daniel Pocock; +Cc: Martin Steigerwald, Andreas Dilger, linux-ext4

Am Dienstag, 8. Mai 2012 schrieb Daniel Pocock:
> On 08/05/12 14:55, Martin Steigerwald wrote:
> > Am Dienstag, 8. Mai 2012 schrieb Daniel Pocock:
> >> On 08/05/12 00:24, Martin Steigerwald wrote:
> >>> Am Montag, 7. Mai 2012 schrieb Daniel Pocock:
> >>>> On 07/05/12 20:59, Martin Steigerwald wrote:
> >>>>> Am Montag, 7. Mai 2012 schrieb Daniel Pocock:
> >>>>>>> Possibly the older disk is lying about doing cache flushes.  The
> >>>>>>> wonderful disk manufacturers do that with commodity drives to make
> >>>>>>> their benchmark numbers look better.  If you run some random IOPS
> >>>>>>> test against this disk, and it has performance much over 100 IOPS
> >>>>>>> then it is definitely not doing real cache flushes.
> >>>>> 
> >>>>> […]
> >>>>> 
> >>>>> I think an IOPS benchmark would be better. I.e. something like:
> >>>>> 
> >>>>> /usr/share/doc/fio/examples/ssd-test
> >>>>> 
> >>>>> (from flexible I/O tester debian package, also included in upstream
> >>>>> tarball of course)
> >>>>> 
> >>>>> adapted to your needs.
> >>>>> 
> >>>>> Maybe with different iodepth or numjobs (to simulate several threads
> >>>>> generating higher iodepths). With iodepth=1 I have seen 54 IOPS on a
> >>>>> Hitachi 5400 rpm harddisk connected via eSATA.
> >>>>> 
> >>>>> Important is direct=1 to bypass the pagecache.
> >>>> 
> >>>> Thanks for suggesting this tool, I've run it against the USB disk and
> >>>> an LV on my AHCI/SATA/md array
> >>>> 
> >>>> Incidentally, I upgraded the Seagate firmware (model 7200.12 from CC34
> >>>> to CC49) and one of the disks went offline shortly after I brought the
> >>>> system back up.  To avoid the risk that a bad drive might interfere
> >>>> with the SATA performance, I completely removed it before running any
> >>>> tests. Tomorrow I'm out to buy some enterprise grade drives, I'm
> >>>> thinking about Seagate Constellation SATA or even SAS.
> >>>> 
> >>>> Anyway, onto the test results:
> >>>> 
> >>>> USB disk (Seagate  9SD2A3-500 320GB):
> >>>> 
> >>>> rand-write: (groupid=3, jobs=1): err= 0: pid=22519
> >>>> 
> >>>>   write: io=46680KB, bw=796512B/s, iops=194, runt= 60012msec
[…]
> >>> Please repeat the test with iodepth=1.
> >> 
> >> For the USB device:
> >> 
> >> rand-write: (groupid=3, jobs=1): err= 0: pid=11855
> >> 
> >>   write: io=49320KB, bw=841713B/s, iops=205, runt= 60001msec
[…]
> >> and for the SATA disk:
> >> 
> >> rand-write: (groupid=3, jobs=1): err= 0: pid=12256
> >> 
> >>   write: io=28020KB, bw=478168B/s, iops=116, runt= 60005msec
[…]
> > […]
> > 
> >>      issued r/w: total=0/7005, short=0/0
> >>      
> >>      lat (msec): 4=6.31%, 10=69.54%, 20=22.68%, 50=0.63%, 100=0.76%
> >>      lat (msec): 250=0.09%
> >>> 
> >>> 194 IOPS appears to be highly unrealistic unless NCQ or something like
> >>> that is in use. At least if thats a 5400/7200 RPM sata drive (didn´t
> >>> check vendor information).
> >> 
> >> The SATA disk does have NCQ
> >> 
> >> USB disk is supposed to be 5400RPM, USB2, but reporting iops=205
> >> 
> >> SATA disk is 7200 RPM, 3 Gigabit SATA, but reporting iops=116
> >> 
> >> Does this suggest that the USB disk is caching data but telling Linux
> >> the data is on disk?
> > 
> > Looks like it.
> > 
> > Some older values for a 1.5 TB WD Green Disk:
> > 
> > mango:~# fio -readonly -name iops -rw=randread -bs=512  -runtime=100
> > -iodepth 1 -filename /dev/sda -ioengine  libaio -direct=1
> > [...] iops: (groupid=0, jobs=1): err= 0: pid=9939
> > 
> >   read : io=1,859KB, bw=19,031B/s, iops=37, runt=100024msec [...]</pre>
> > 
> > mango:~# fio -readonly -name iops -rw=randread -bs=512  -runtime=100
> > -iodepth 32 -filename /dev/sda -ioengine  libaio -direct=1
> > iops: (groupid=0, jobs=1): err= 0: pid=10304
> > 
> >   read : io=2,726KB, bw=27,842B/s, iops=54, runt=100257msec
> > 
> > mango:~# hdparm -I /dev/sda | grep -i queue
> > 
> >         Queue depth: 32
> >         
> >            *    Native Command Queueing (NCQ)
> > 
> > - 1,5 TB Western Digital, WDC WD15EADS-00P8B0
> > - Pentium 4 mit 2,80 GHz
> > - 4 GB RAM, 32-Bit Linux
> > - Linux Kernel 2.6.36
> > - fio 1.38-1
[…]
> >> It is a gigabit network and I think that the performance of the dd
> >> command proves it is not something silly like a cable fault (I have come
> >> across such faults elsewhere though)
> > 
> > What is the latency?
> 
> $ ping -s 1000 192.168.1.2
> PING 192.168.1.2 (192.168.1.2) 1000(1028) bytes of data.
> 1008 bytes from 192.168.1.2: icmp_req=1 ttl=64 time=0.307 ms
> 1008 bytes from 192.168.1.2: icmp_req=2 ttl=64 time=0.341 ms
> 1008 bytes from 192.168.1.2: icmp_req=3 ttl=64 time=0.336 ms

Seems to be fine.

> >>> Anyway, 15000 RPM SAS drives should give you more IOPS than 7200 RPM
> >>> SATA drives, but SATA drives are cheaper and thus you could -
> >>> depending on RAID level - increase IOPS by just using more drives.
> >> 
> >> I was thinking about the large (2TB or 3TB) 7200 RPM SAS or SATA drives
> >> in the Seagate `Constellation' enterprise drive range.  I need more
> >> space anyway, and I need to replace the drive that failed, so I have to
> >> spend some money anyway - I just want to throw it in the right direction
> >> (e.g. buying a drive, or if the cheap on-board SATA controller is a
> >> bottleneck or just extremely unsophisticated, I don't mind getting a
> >> dedicated controller)
> >> 
> >> For example, if I knew that the controller is simply not suitable with
> >> barriers, NFS, etc and that a $200 RAID card or even a $500 RAID card
> >> will guarantee better performance with my current kernel, I would buy
> >> that.  (However, I do want to use md RAID rather than a proprietary
> >> format, so any RAID card would be in JBOD mode)
> > 
> > They point is: How much of the performance will arrive at NFS? I can't
> > say yet.
> 
> My impression is that the faster performance of the USB disk was a red
> herring, and the problem really is just the nature of the NFS protocol
> and the way it is stricter about server-side caching (when sync is
> enabled) and consequently it needs more iops.

Yes, that seems to be the case here. It seems to be a small blocksize random 
I/O workload with heavy fsync() usage.

You could adapt to /usr/share/doc/fio/examples/iometer-file-access-server to 
benchmark such a scenario. Also fsmark simulates such a heavy fsync() based 
quite well. I have packaged it for Debian, but its still in NEW queue. You can 
grab it from

http://people.teamix.net/~ms/debian/sid/

(32-Bit build, but easily buildable for amd64 as well)

> I've turned two more machines (a HP Z800 with SATA disk and a Lenovo
> X220 with SSD disk) into NFSv3 servers, repeated the same tests, and
> found similar performance on the Z800, but 20x faster on the SSD (which
> can support more IOPS)

Okay, then you want more IOPS.

> > And wait I/O is quite high.
> > 
> > Thus it seems this workload can be faster with faster / more disks or a
> > RAID controller with battery (and disabling barriers / cache flushes).
> 
> You mean barrier=0,data=writeback?  Or just barrier=0,data=ordered?

I meant data=ordered. As mentioned by Andreas data=journal could yield a 
improvement. I'd suggest trying to but the journal onto a different disk then 
in order to avoid head seeks during writeout of journal data to its final 
location.

> In theory that sounds good, but in practice I understand it creates some
> different problems, eg:
> 
> - monitoring the battery, replacing it periodically
> 
> - batteries only hold the charge for a few hours, so if there is a power
> outage on a Sunday, someone tries to turn on the server on  Monday
> morning and the battery has died, cache is empty and disk is corrupt

Hmmm, from what I know there are NVRAM based controllers that can hold the 
cached data for several days.

> - some RAID controllers (e.g. HP SmartArray) insist on writing their
> metadata to all volumes - so you become locked in to the RAID vendor.  I
> prefer to just use RAID1 or RAID10 with Linux md onto the raw disks.  On
> some Adaptec controllers, `JBOD' mode allows md to access the disks
> directly, although I haven't verified that yet.

I see no reason why SoftRAID cannot be used with a NVRAM based controller.
 
> I'm tempted to just put a UPS on the server and enable NFS `async' mode,
> and avoid running anything on the server that may cause a crash.

A UPS on the server won't make "async" safe. If the server crashes you still 
can loose data.

Ciao,
-- 
Martin Steigerwald - teamix GmbH - http://www.teamix.de
gpg: 19E3 8D42 896F D004 08AC A0CA 1E10 C593 0399 AE90
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ext4, barrier, md/RAID1 and write cache
  2012-05-09  7:30                     ` Martin Steigerwald
@ 2012-05-09  9:34                       ` Martin Steigerwald
  0 siblings, 0 replies; 14+ messages in thread
From: Martin Steigerwald @ 2012-05-09  9:34 UTC (permalink / raw)
  To: Daniel Pocock; +Cc: Martin Steigerwald, Andreas Dilger, linux-ext4

Am Mittwoch, 9. Mai 2012 schrieb Martin Steigerwald:
> You could adapt to /usr/share/doc/fio/examples/iometer-file-access-server
> to  benchmark such a scenario. Also fsmark simulates such a heavy fsync()
> based quite well. I have packaged it for Debian, but its still in NEW
> queue. You can grab it from
> 
> http://people.teamix.net/~ms/debian/sid/
> 
> (32-Bit build, but easily buildable for amd64 as well)

I have uploaded 64-bit builds of both fsmark and newer fio 2.0.7 as I needed 
them for my own use. They are build for Wheezy/Sid, but according to 
dependencies should work on Squeeze as well.

-- 
Martin Steigerwald - teamix GmbH - http://www.teamix.de
gpg: 19E3 8D42 896F D004 08AC A0CA 1E10 C593 0399 AE90

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2012-05-09  9:34 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-07 10:47 ext4, barrier, md/RAID1 and write cache Daniel Pocock
2012-05-07 16:25 ` Martin Steigerwald
2012-05-07 16:44   ` Daniel Pocock
2012-05-07 16:54     ` Andreas Dilger
2012-05-07 17:28       ` Daniel Pocock
2012-05-07 18:59         ` Martin Steigerwald
2012-05-07 20:56           ` Daniel Pocock
2012-05-07 22:24             ` Martin Steigerwald
2012-05-07 23:23               ` Daniel Pocock
2012-05-08 14:55                 ` Martin Steigerwald
2012-05-08 15:28                   ` Daniel Pocock
2012-05-08 17:02                     ` Andreas Dilger
2012-05-09  7:30                     ` Martin Steigerwald
2012-05-09  9:34                       ` Martin Steigerwald

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.