Re: ext4, barrier, md/RAID1 and write cache

From: Martin Steigerwald <ms@teamix.de>
To: Daniel Pocock <daniel@pocock.com.au>
Cc: Martin Steigerwald <Martin@lichtvoll.de>,
	Andreas Dilger <adilger@dilger.ca>,
	linux-ext4@vger.kernel.org
Subject: Re: ext4, barrier, md/RAID1 and write cache
Date: Tue, 8 May 2012 16:55:37 +0200	[thread overview]
Message-ID: <201205081655.38146.ms@teamix.de> (raw)
In-Reply-To: <4FA85960.6040703@pocock.com.au>

Am Dienstag, 8. Mai 2012 schrieb Daniel Pocock:
> On 08/05/12 00:24, Martin Steigerwald wrote:
> > Am Montag, 7. Mai 2012 schrieb Daniel Pocock:
> >> On 07/05/12 20:59, Martin Steigerwald wrote:
> >>> Am Montag, 7. Mai 2012 schrieb Daniel Pocock:
> >>>>> Possibly the older disk is lying about doing cache flushes.  The
> >>>>> wonderful disk manufacturers do that with commodity drives to make
> >>>>> their benchmark numbers look better.  If you run some random IOPS
> >>>>> test against this disk, and it has performance much over 100 IOPS
> >>>>> then it is definitely not doing real cache flushes.
> >>> 
> >>> […]
> >>> 
> >>> I think an IOPS benchmark would be better. I.e. something like:
> >>> 
> >>> /usr/share/doc/fio/examples/ssd-test
> >>> 
> >>> (from flexible I/O tester debian package, also included in upstream
> >>> tarball of course)
> >>> 
> >>> adapted to your needs.
> >>> 
> >>> Maybe with different iodepth or numjobs (to simulate several threads
> >>> generating higher iodepths). With iodepth=1 I have seen 54 IOPS on a
> >>> Hitachi 5400 rpm harddisk connected via eSATA.
> >>> 
> >>> Important is direct=1 to bypass the pagecache.
> >> 
> >> Thanks for suggesting this tool, I've run it against the USB disk and
> >> an LV on my AHCI/SATA/md array
> >> 
> >> Incidentally, I upgraded the Seagate firmware (model 7200.12 from CC34
> >> to CC49) and one of the disks went offline shortly after I brought the
> >> system back up.  To avoid the risk that a bad drive might interfere
> >> with the SATA performance, I completely removed it before running any
> >> tests. Tomorrow I'm out to buy some enterprise grade drives, I'm
> >> thinking about Seagate Constellation SATA or even SAS.
> >> 
> >> Anyway, onto the test results:
> >> 
> >> USB disk (Seagate  9SD2A3-500 320GB):
> >> 
> >> rand-write: (groupid=3, jobs=1): err= 0: pid=22519
> >> 
> >>   write: io=46680KB, bw=796512B/s, iops=194, runt= 60012msec
> >>   
> >>     slat (usec): min=13, max=25264, avg=106.02, stdev=525.18
> >>     clat (usec): min=993, max=103568, avg=20444.19, stdev=11622.11
> >>     bw (KB/s) : min=  521, max= 1224, per=100.06%, avg=777.48,
> >> 
> >> stdev=97.07 cpu          : usr=0.73%, sys=2.33%, ctx=12024, majf=0,
> >> minf=20 IO depths    : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%,
> >> 32=0.0%,
> > 
> > Please repeat the test with iodepth=1.
> 
> For the USB device:
> 
> rand-write: (groupid=3, jobs=1): err= 0: pid=11855
>   write: io=49320KB, bw=841713B/s, iops=205, runt= 60001msec
>     slat (usec): min=67, max=6234, avg=112.62, stdev=136.92
>     clat (usec): min=684, max=97358, avg=4737.20, stdev=4824.08
>     bw (KB/s) : min=  588, max= 1029, per=100.46%, avg=824.74, stdev=84.47
>   cpu          : usr=0.64%, sys=2.89%, ctx=12751, majf=0, minf=21
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> 
> >=64=0.0%
> 
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> 
> >=64=0.0%
> 
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> 
> >=64=0.0%
> 
>      issued r/w: total=0/12330, short=0/0
>      lat (usec): 750=0.02%, 1000=0.48%
>      lat (msec): 2=1.05%, 4=66.65%, 10=26.32%, 20=1.46%, 50=3.99%
>      lat (msec): 100=0.03%
> 
> and for the SATA disk:
> 
> rand-write: (groupid=3, jobs=1): err= 0: pid=12256
>   write: io=28020KB, bw=478168B/s, iops=116, runt= 60005msec
>     slat (usec): min=58, max=132637, avg=110.51, stdev=1623.80
>     clat (msec): min=2, max=206, avg= 8.44, stdev= 7.10
>     bw (KB/s) : min=   95, max=  566, per=100.24%, avg=467.11, stdev=97.64
>   cpu          : usr=0.36%, sys=1.17%, ctx=7196, majf=0, minf=21
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
[…]
>      issued r/w: total=0/7005, short=0/0
> 
>      lat (msec): 4=6.31%, 10=69.54%, 20=22.68%, 50=0.63%, 100=0.76%
>      lat (msec): 250=0.09%
> 
> > 194 IOPS appears to be highly unrealistic unless NCQ or something like
> > that is in use. At least if thats a 5400/7200 RPM sata drive (didn´t
> > check vendor information).
> 
> The SATA disk does have NCQ
> 
> USB disk is supposed to be 5400RPM, USB2, but reporting iops=205
> 
> SATA disk is 7200 RPM, 3 Gigabit SATA, but reporting iops=116
> 
> Does this suggest that the USB disk is caching data but telling Linux
> the data is on disk?

Looks like it.

Some older values for a 1.5 TB WD Green Disk:

mango:~# fio -readonly -name iops -rw=randread -bs=512  -runtime=100 -iodepth 1 
-filename /dev/sda -ioengine  libaio -direct=1
[...] iops: (groupid=0, jobs=1): err= 0: pid=9939
  read : io=1,859KB, bw=19,031B/s, iops=37, runt=100024msec [...]</pre>

mango:~# fio -readonly -name iops -rw=randread -bs=512  -runtime=100 -iodepth 
32 -filename /dev/sda -ioengine  libaio -direct=1
iops: (groupid=0, jobs=1): err= 0: pid=10304
  read : io=2,726KB, bw=27,842B/s, iops=54, runt=100257msec

mango:~# hdparm -I /dev/sda | grep -i queue
        Queue depth: 32
           *    Native Command Queueing (NCQ)

- 1,5 TB Western Digital, WDC WD15EADS-00P8B0
- Pentium 4 mit 2,80 GHz
- 4 GB RAM, 32-Bit Linux
- Linux Kernel 2.6.36
- fio 1.38-1

> >> The IOPS scores look similar, but I checked carefully and I'm fairly
> >> certain the disks were mounted correctly when the tests ran.
> >> 
> >> Should I run this tool over NFS, will the results be meaningful?
> >> 
> >> Given the need to replace a drive anyway, I'm really thinking about one
> >> of the following approaches:
> >> - same controller, upgrade to enterprise SATA drives
> >> - buy a dedicated SAS/SATA controller, upgrade to enterprise SATA
> >> drives
> >> - buy a dedicated SAS/SATA controller, upgrade to SAS drives
> >> 
> >> My HP N36L is quite small, one PCIe x16 slot, the internal drive cage
> >> has an SFF-8087 (mini SAS) plug, so I'm thinking I can grab something
> >> small like the Adaptec 1405 - will any of these solutions offer a
> >> definite win with my NFS issues though?
> > 
> > First I would like to understand more closely what your NFS issues are.
> > Before throwing money at the problem its important to understand what the
> > problem actually is.
> 
> When I do things like unpacking a large source tarball, iostat reports
> throughput to the drive between 500-1000kBytes/second
> 
> When I do the same operation onto the USB drive over NFS, I see over
> 5000kBytes/second - but it appears from the iops test figures that the
> USB drive is cheating, so we'll ignore that.
> 
> - if I just dd to the SATA drive over NFS (with conv=fsync), I see much
> faster speeds

Easy. Less roundtrips.

Just watch nfsstat -3 while untarring a tarball over NFS to see what I mean.

> - if I'm logged in to the server, and I unpack the same tarball onto the
> same LV, the operation completes at 30MBytes/sec

No network.

Thats the LV on the internal disk?

> It is a gigabit network and I think that the performance of the dd
> command proves it is not something silly like a cable fault (I have come
> across such faults elsewhere though)

What is the latency?

> > Anyway, 15000 RPM SAS drives should give you more IOPS than 7200 RPM SATA
> > drives, but SATA drives are cheaper and thus you could - depending on
> > RAID level - increase IOPS by just using more drives.
> 
> I was thinking about the large (2TB or 3TB) 7200 RPM SAS or SATA drives
> in the Seagate `Constellation' enterprise drive range.  I need more
> space anyway, and I need to replace the drive that failed, so I have to
> spend some money anyway - I just want to throw it in the right direction
> (e.g. buying a drive, or if the cheap on-board SATA controller is a
> bottleneck or just extremely unsophisticated, I don't mind getting a
> dedicated controller)
> 
> For example, if I knew that the controller is simply not suitable with
> barriers, NFS, etc and that a $200 RAID card or even a $500 RAID card
> will guarantee better performance with my current kernel, I would buy
> that.  (However, I do want to use md RAID rather than a proprietary
> format, so any RAID card would be in JBOD mode)

They point is: How much of the performance will arrive at NFS? I can't say 
yet.

> > But still first I´d like to understand *why* its slow.
> > 
> > What does
> > 
> > iostat -x -d -m 5
> > vmstat 5
> > 
> > say when excersing the slow (and probably a faster) setup? See [1].
> 
> All the iostat output is typically like this:
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> dm-23             0.00     0.00    0.20  187.60     0.00     0.81
> 8.89     2.02   10.79   5.07  95.20
> dm-23             0.00     0.00    0.20  189.80     0.00     0.91
> 9.84     1.95   10.29   4.97  94.48
> dm-23             0.00     0.00    0.20  228.60     0.00     1.00
> 8.92     1.97    8.58   4.10  93.92
> dm-23             0.00     0.00    0.20  231.80     0.00     0.98
> 8.70     1.96    8.49   4.06  94.16
> dm-23             0.00     0.00    0.20  229.20     0.00     0.94
> 8.40     1.92    8.39   4.10  94.08

Hmmm, disk looks quite utilitzed. Are there other I/O workloads on the 
machine?

> and vmstat:
> 
> procs -----------memory---------- ---swap-- -----io---- -system--
> ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
> id wa
> ...
>  0  1      0 6881772 118660 576712    0    0     1  1033  720 1553  0  2
> 60 38
>  0  1      0 6879068 120220 577892    0    0     1   918  793 1595  0  2
> 56 41
>  0  1      0 6876208 122200 578684    0    0     1  1055  767 1731  0  2
> 67 31
>  1  1      0 6873356 124176 579392    0    0     1  1014  742 1688  0  2
> 66 32
>  0  1      0 6870628 126132 579904    0    0     1  1007  753 1683  0  2
> 66 32

And wait I/O is quite high.

Thus it seems this workload can be faster with faster / more disks or a RAID 
controller with battery (and disabling barriers / cache flushes).

> and nfsstat -s -o all -l -Z5
> 
> nfs v3 server        total:      319
> ------------- ------------- --------
> nfs v3 server      getattr:        1
> nfs v3 server      setattr:      126
> nfs v3 server       access:        6
> nfs v3 server        write:       61
> nfs v3 server       create:       61
> nfs v3 server        mkdir:        3
> nfs v3 server       commit:       61

I would like to see nfsiostat from newer nfs-utils, cause it includes 
latencies.

> > [1]
> > http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_whe
> > n_reporting_a_problem.3F
> 
> I've also tested onto btrfs and the performance was equally bad, so it
> may not be an ext4 issue
> 
> The environment is:
> Linux srv1 3.2.0-0.bpo.2-amd64 #1 SMP Mon Apr 23 08:38:01 UTC 2012
> x86_64 GNU/Linux
> (Debian squeeze)
> Kernel NFS v3
> HP N36L server, onboard AHCI
>  md RAID1 as a 1TB device (/dev/md2)
> /dev/md2 is a PV for LVM - no other devices attached
> 
> As mentioned before, I've tried with and without write cache.
> dmesg reports that ext4 (and btrfs) seem to be happy to accept the
> barrier=1 or barrier=0 setting with the drives.

3.2 doesn't report failure on barriers anymore. Barriers have been switched to 
cache flush requests and these will not report back failure. So you have to 
make sure cache flushes work in other ways.

> dmesg and hdparm also appear to report accurate information about write
> cache status.
> 
> > (quite some of this should be relevant when reporting with ext4 as well)
> > 
> > As for testing with NFS: I except the values to drop. NFS has quite some
> > protocol overhead due to network roundtrips. On my nasic tests NFSv4 even
> > more so than NFSv3. As for NFS I suggest trying nfsiostat python script
> > from newer nfs-utils. It also shows latencies.
> 
> I agree - but 500kBytes/sec is just so much slower than anything I've
> seen with any IO device in recent years.  I don't expect to get 90% of
> the performance of a local disk, but is getting 30-50% reasonable?

Depends on the workload.

You might consider using FS-Cache with cachefilesd for local client side 
caching.

Ciao,
-- 
Martin Steigerwald - teamix GmbH - http://www.teamix.de
gpg: 19E3 8D42 896F D004 08AC A0CA 1E10 C593 0399 AE90
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html