All of lore.kernel.org
 help / color / mirror / Atom feed
* Disk slows down while empting write buffer
@ 2019-09-14 16:22 Elliott Balsley
  2019-09-15  6:42 ` Sitsofe Wheeler
  0 siblings, 1 reply; 7+ messages in thread
From: Elliott Balsley @ 2019-09-14 16:22 UTC (permalink / raw)
  To: fio

I am trying to figure out why end_fsync causes the overall disk write
to take longer.  I'm aware that without this option, the test may
return before all data is flushed to disk, but I am monitoring actual
disk writes with iostat.  At the beginning of the test, iostat shows
165MBps writing to each disk, and fio shows over 3000MBps (because
it's writing to the buffer, and the buffer is simultaneously going to
disk).
Near the end, fio drops to 0MBps which means it's waiting for fsync to
finish.  At this time, iostat drops to 120MBps per disk.

fio --name=w --rw=write --bs=1M --size=30G --directory=/mnt/stripe --end_fsync=1

If I instead run this command with end_fsync=0 then iostat shows
steady 165MBps the whole time.

I have observed this issue on CentOS with an mdadm soft RAID with XFS.
And also on FreeBSD with a ZFS pool.  If I instead run it on a single
disk formatted XFS then it does not use the cache at all, it just
writes steady at 165MBps, I'm not sure why that is.

I'd appreciate any insights to this.
-Elliott


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Disk slows down while empting write buffer
  2019-09-14 16:22 Disk slows down while empting write buffer Elliott Balsley
@ 2019-09-15  6:42 ` Sitsofe Wheeler
  2019-09-15 22:10   ` Elliott Balsley
  0 siblings, 1 reply; 7+ messages in thread
From: Sitsofe Wheeler @ 2019-09-15  6:42 UTC (permalink / raw)
  To: Elliott Balsley; +Cc: fio

Hi,

On Sat, 14 Sep 2019 at 22:18, Elliott Balsley <elliott@altsystems.com> wrote:
>
> I am trying to figure out why end_fsync causes the overall disk write
> to take longer.  I'm aware that without this option, the test may
> return before all data is flushed to disk, but I am monitoring actual
> disk writes with iostat.  At the beginning of the test, iostat shows
> 165MBps writing to each disk, and fio shows over 3000MBps (because
> it's writing to the buffer, and the buffer is simultaneously going to
> disk).
> Near the end, fio drops to 0MBps which means it's waiting for fsync to
> finish.  At this time, iostat drops to 120MBps per disk.
>
> fio --name=w --rw=write --bs=1M --size=30G --directory=/mnt/stripe --end_fsync=1
>
> If I instead run this command with end_fsync=0 then iostat shows
> steady 165MBps the whole time.
>
> I have observed this issue on CentOS with an mdadm soft RAID with XFS.
> And also on FreeBSD with a ZFS pool.  If I instead run it on a single
> disk formatted XFS then it does not use the cache at all, it just
> writes steady at 165MBps, I'm not sure why that is.

Hmm, I doubt this is an fio specific question and likely applies to
I/O through filesystems in general.

When you're working with RAID-like devices you have to get
acknowledgement from ALL disks in the RAID to know data has reached
non-volatile storage (due to striping). As fsync also ensures metadata
makes it down to the non-volatile storage too (in additional to
regular data), it could be that you end up waiting on metadata to be
flushed in-between sending data (this is a wild guess) thus resulting
in a slower speed (the fsync might be being done as a drain and
flush). Another wild idea is that it could be that the filesystem now
prioritises getting smaller chunks of data to disk as soon as possible
over throughput. You would likely have to talk to a filesystem
developer (e.g. via
http://vger.kernel.org/vger-lists.html#linux-fsdevel ) or otherwise
trace what was happening at the filesystem level to know anything for
sure though... If you do find out a definitive answer please let us
know :-)

You can (somewhat) rule out a given layer by doing writes straight to
a block device (THIS WILL DESTROY ANY FILESYSTEM AND DATA ON THE
DEVICE) and seeing if you get the same behavior. Another idea is to
see if things change when you use fdatasync
(https://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-fdatasync
) with a number very close to the total number of I/Os (30720) you are
doing to see if its metadata related on Linux (I don't think FreeBSD
supports that operation and fio just fallsback to doing fsync).

> disk formatted XFS then it does not use the cache at all, it just

I'd be surprised if it bypassed the cache entirely and you would
likely have to watch what was happening in /proc/meminfo while your
test was running to determine this. It's also worth monitoring what
the utilization of your disks (which you will see when you add -x to
the end of iostat) to see if the system thinks there's spare I/O
capacity or whether they are being pushed to their limit.

-- 
Sitsofe | http://sucs.org/~sits/


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Disk slows down while empting write buffer
  2019-09-15  6:42 ` Sitsofe Wheeler
@ 2019-09-15 22:10   ` Elliott Balsley
  2019-09-16  6:32     ` Sitsofe Wheeler
  0 siblings, 1 reply; 7+ messages in thread
From: Elliott Balsley @ 2019-09-15 22:10 UTC (permalink / raw)
  To: Sitsofe Wheeler; +Cc: fio

Thanks for your suggestions.  Unfortunately, now I'm not confident in
these results, because today this behavior is not happening, even with
the same zpool on the same system.  Now it's writing at a steady rate
to disks the whole time, instead of starting out very high (3GBps to
memory) and then dropping to zero at the end.  So essentially the
fsync doesn't have to sync anything at the end.  Is there some way to
control this behavior?  I'm not sure if it's fio or the filesystem,
but why does it write at disk speed today, while it wrote at memory
speed before?


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Disk slows down while empting write buffer
  2019-09-15 22:10   ` Elliott Balsley
@ 2019-09-16  6:32     ` Sitsofe Wheeler
  2019-10-02  0:40       ` Elliott Balsley
  0 siblings, 1 reply; 7+ messages in thread
From: Sitsofe Wheeler @ 2019-09-16  6:32 UTC (permalink / raw)
  To: Elliott Balsley; +Cc: fio

Hi,

On Sun, 15 Sep 2019 at 23:10, Elliott Balsley <elliott@altsystems.com> wrote:
>
> Thanks for your suggestions.  Unfortunately, now I'm not confident in
> these results, because today this behavior is not happening, even with
> the same zpool on the same system.  Now it's writing at a steady rate
> to disks the whole time, instead of starting out very high (3GBps to
> memory) and then dropping to zero at the end.  So essentially the
> fsync doesn't have to sync anything at the end.  Is there some way to
> control this behavior?  I'm not sure if it's fio or the filesystem,
> but why does it write at disk speed today, while it wrote at memory
> speed before?

It's unlikely to be fio unless you're changing your job file (which
might mean we're not the best list to get an answer from). Maybe it's
because you're now overwriting a fully written file - does removing
the file between fio runs make any difference? You didn't post any of
your fio output so there's very little for us to go on - did you know
fio on Linux will try and show you averaged disk stats when it
finishes? We also don't know what kernel you're using etc.

Maybe memory pressure on your system is different to what it was
previously? There are controls like
/proc/sys/vm/dirty_background_ratio ,
/proc/sys/vm/dirty_expire_centisecs etc (see
https://www.kernel.org/doc/Documentation/sysctl/vm.txt for the full
list) that control when writeback starts and it could be you're in a
scenario where the system believes it has to start writeback early
(e.g. you're running a program that chews up huge amounts of memory).
Maybe /proc/meminfo will offer some clues?

Again my recommendation would be to do I/O against the raw block
device representing the RAID (or an individual disk) understanding
that it will destroy any data already there if you do writes. At least
then you will be able to say whether it's an issue going through a
filesystem or whether it's something from the block layer.

-- 
Sitsofe | http://sucs.org/~sits/


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Disk slows down while empting write buffer
  2019-09-16  6:32     ` Sitsofe Wheeler
@ 2019-10-02  0:40       ` Elliott Balsley
  2019-10-02 20:12         ` Sitsofe Wheeler
  0 siblings, 1 reply; 7+ messages in thread
From: Elliott Balsley @ 2019-10-02  0:40 UTC (permalink / raw)
  To: Sitsofe Wheeler; +Cc: fio

> Again my recommendation would be to do I/O against the raw block
> device representing the RAID (or an individual disk)

I didn't know fio could write to a block device.  Or do you mean with dd?

I'm still investigating the original question, but now I notice
something odd about fio.  When writing a single file, I would expect
end_fsync to function the same way as fsync_on_close.  But the latter
option displays a much higher bandwidth at the end of the test.  In
both cases, iostat activity is relatively constant the entire time,
about 230MBps.  In both cases, fio displays the correct speed during
the test, and then shows 0KBps during fsync at the end.  It seems like
these two options have a different way of averaging the write and sync
speeds.  Full output of both commands below.  I'm running on FreeBSD
(FreeNAS 11.2), writing to a single disk vdev on ZFS.

root@freenas ~ # fio --name=larry --rw=write --bs=1M --size=10G
--directory=/single --fsync_on_close=1
larry: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB,
(T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
fio-3.5
Starting 1 process
Jobs: 1 (f=1): [f(1)][100.0%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s]
larry: (groupid=0, jobs=1): err= 0: pid=17944: Tue Oct  1 17:30:20 2019
  write: IOPS=372, BW=372MiB/s (390MB/s)(10.0GiB/27501msec)
    clat (usec): min=236, max=245025, avg=2662.75, stdev=3163.75
     lat (usec): min=246, max=245057, avg=2683.97, stdev=3167.04
    clat percentiles (usec):
     |  1.00th=[  243],  5.00th=[  247], 10.00th=[  249], 20.00th=[  258],
     | 30.00th=[  766], 40.00th=[  799], 50.00th=[ 3785], 60.00th=[ 4047],
     | 70.00th=[ 4293], 80.00th=[ 4490], 90.00th=[ 4752], 95.00th=[ 4948],
     | 99.00th=[ 5407], 99.50th=[ 5669], 99.90th=[12518], 99.95th=[20841],
     | 99.99th=[38011]
   bw (  KiB/s): min=159244, max=3863272, per=100.00%, avg=381324.09,
stdev=584042.66, samples=54
   iops        : min=  155, max= 3772, avg=372.00, stdev=570.30, samples=54
  lat (usec)   : 250=12.21%, 500=12.47%, 750=2.94%, 1000=16.71%
  lat (msec)   : 2=0.41%, 4=12.44%, 10=42.67%, 20=0.09%, 50=0.06%
  lat (msec)   : 250=0.01%
  cpu          : usr=0.92%, sys=10.81%, ctx=67434, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10240,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=372MiB/s (390MB/s), 372MiB/s-372MiB/s (390MB/s-390MB/s),
io=10.0GiB (10.7GB), run=27501-27501msec
root@freenas ~ # fio --name=larry --rw=write --bs=1M --size=10G
--directory=/single --end_fsync=1
larry: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB,
(T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
fio-3.5
Starting 1 process
Jobs: 1 (f=1): [F(1)][100.0%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s]
larry: (groupid=0, jobs=1): err= 0: pid=18136: Tue Oct  1 17:31:34 2019
  write: IOPS=158, BW=158MiB/s (166MB/s)(10.0GiB/64645msec)
    clat (usec): min=238, max=273565, avg=2680.80, stdev=3750.63
     lat (usec): min=246, max=273580, avg=2701.70, stdev=3753.46
    clat percentiles (usec):
     |  1.00th=[   243],  5.00th=[   247], 10.00th=[   249], 20.00th=[   258],
     | 30.00th=[   766], 40.00th=[   799], 50.00th=[  3752], 60.00th=[  4047],
     | 70.00th=[  4293], 80.00th=[  4490], 90.00th=[  4752], 95.00th=[  4948],
     | 99.00th=[  5342], 99.50th=[  5604], 99.90th=[ 13566], 99.95th=[ 29492],
     | 99.99th=[164627]
   bw (  KiB/s): min=53141, max=3875850, per=100.00%, avg=376190.78,
stdev=577953.67, samples=55
   iops        : min=   51, max= 3785, avg=366.98, stdev=564.45, samples=55
  lat (usec)   : 250=11.87%, 500=12.82%, 750=2.62%, 1000=16.95%
  lat (msec)   : 2=0.41%, 4=13.14%, 10=42.04%, 20=0.07%, 50=0.06%
  lat (msec)   : 250=0.01%, 500=0.01%
  cpu          : usr=0.36%, sys=5.34%, ctx=100156, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10240,0,1 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=158MiB/s (166MB/s), 158MiB/s-158MiB/s (166MB/s-166MB/s),
io=10.0GiB (10.7GB), run=64645-64645msec


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Disk slows down while empting write buffer
  2019-10-02  0:40       ` Elliott Balsley
@ 2019-10-02 20:12         ` Sitsofe Wheeler
  2019-10-02 23:55           ` Elliott Balsley
  0 siblings, 1 reply; 7+ messages in thread
From: Sitsofe Wheeler @ 2019-10-02 20:12 UTC (permalink / raw)
  To: Elliott Balsley; +Cc: fio

On Wed, 2 Oct 2019 at 01:40, Elliott Balsley <elliott@altsystems.com> wrote:
>
> > Again my recommendation would be to do I/O against the raw block
> > device representing the RAID (or an individual disk)
>
> I didn't know fio could write to a block device.  Or do you mean with dd?

fio can definitely use block devices when you run it with permissions
to access them :-)

> I'm still investigating the original question, but now I notice
> something odd about fio.  When writing a single file, I would expect
> end_fsync to function the same way as fsync_on_close.  But the latter
> option displays a much higher bandwidth at the end of the test.  In
> both cases, iostat activity is relatively constant the entire time,
> about 230MBps.  In both cases, fio displays the correct speed during
> the test, and then shows 0KBps during fsync at the end.  It seems like
> these two options have a different way of averaging the write and sync
> speeds.  Full output of both commands below.  I'm running on FreeBSD
> (FreeNAS 11.2), writing to a single disk vdev on ZFS.

Looking through the code shows the fsync happens at different points:

fsync_on_close:
#0  put_file (td=0x7fffd430d000, f=0x7ffff46e9510) at filesetup.c:1707
#1  0x00000000004254aa in td_io_close_file (td=0x7fffd430d000,
f=0x7ffff46e9510) at ioengines.c:573
#2  0x0000000000440c74 in close_and_free_files (td=0x7fffd430d000) at
filesetup.c:1405
#3  0x0000000000485d94 in thread_main (data=0xffc0c0) at backend.c:1908
#4  0x00007ffff6b676db in start_thread (arg=0x7fffc12e6700) at
pthread_create.c:463
#5  0x00007ffff647488f in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95

end_fsync:
#0  do_io (td=0x7fffd430d000, bytes_done=0x7fffc12e5e50) at backend.c:1131
#1  0x0000000000485891 in thread_main (data=0xffc0c0) at backend.c:1788
#2  0x00007ffff6b676db in start_thread (arg=0x7fffc12e6700) at
pthread_create.c:463
#3  0x00007ffff647488f in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95

I would have expected fsync_on_close to potentially be the slower
because it triggers the fsync somewhat later but that not the case
below. Maybe the total run time is taken too soon in the
fsync_on_close case
(https://github.com/axboe/fio/blob/fio-3.15/backend.c#L1878 ) because
fsyncing doesn't occur until line 1908...

>
> root@freenas ~ # fio --name=larry --rw=write --bs=1M --size=10G
> --directory=/single --fsync_on_close=1
> larry: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB,
> (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
> fio-3.5
> Starting 1 process
> Jobs: 1 (f=1): [f(1)][100.0%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s]
> larry: (groupid=0, jobs=1): err= 0: pid=17944: Tue Oct  1 17:30:20 2019
>   write: IOPS=372, BW=372MiB/s (390MB/s)(10.0GiB/27501msec)
>     clat (usec): min=236, max=245025, avg=2662.75, stdev=3163.75
>      lat (usec): min=246, max=245057, avg=2683.97, stdev=3167.04
>     clat percentiles (usec):
>      |  1.00th=[  243],  5.00th=[  247], 10.00th=[  249], 20.00th=[  258],
>      | 30.00th=[  766], 40.00th=[  799], 50.00th=[ 3785], 60.00th=[ 4047],
>      | 70.00th=[ 4293], 80.00th=[ 4490], 90.00th=[ 4752], 95.00th=[ 4948],
>      | 99.00th=[ 5407], 99.50th=[ 5669], 99.90th=[12518], 99.95th=[20841],
>      | 99.99th=[38011]
>    bw (  KiB/s): min=159244, max=3863272, per=100.00%, avg=381324.09,
> stdev=584042.66, samples=54
>    iops        : min=  155, max= 3772, avg=372.00, stdev=570.30, samples=54
>   lat (usec)   : 250=12.21%, 500=12.47%, 750=2.94%, 1000=16.71%
>   lat (msec)   : 2=0.41%, 4=12.44%, 10=42.67%, 20=0.09%, 50=0.06%
>   lat (msec)   : 250=0.01%
>   cpu          : usr=0.92%, sys=10.81%, ctx=67434, majf=0, minf=0
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued rwts: total=0,10240,0,0 short=0,0,0,0 dropped=0,0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=1
>
> Run status group 0 (all jobs):
>   WRITE: bw=372MiB/s (390MB/s), 372MiB/s-372MiB/s (390MB/s-390MB/s),
> io=10.0GiB (10.7GB), run=27501-27501msec
> root@freenas ~ # fio --name=larry --rw=write --bs=1M --size=10G
> --directory=/single --end_fsync=1
> larry: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB,
> (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
> fio-3.5
> Starting 1 process
> Jobs: 1 (f=1): [F(1)][100.0%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s]
> larry: (groupid=0, jobs=1): err= 0: pid=18136: Tue Oct  1 17:31:34 2019
>   write: IOPS=158, BW=158MiB/s (166MB/s)(10.0GiB/64645msec)
>     clat (usec): min=238, max=273565, avg=2680.80, stdev=3750.63
>      lat (usec): min=246, max=273580, avg=2701.70, stdev=3753.46
>     clat percentiles (usec):
>      |  1.00th=[   243],  5.00th=[   247], 10.00th=[   249], 20.00th=[   258],
>      | 30.00th=[   766], 40.00th=[   799], 50.00th=[  3752], 60.00th=[  4047],
>      | 70.00th=[  4293], 80.00th=[  4490], 90.00th=[  4752], 95.00th=[  4948],
>      | 99.00th=[  5342], 99.50th=[  5604], 99.90th=[ 13566], 99.95th=[ 29492],
>      | 99.99th=[164627]
>    bw (  KiB/s): min=53141, max=3875850, per=100.00%, avg=376190.78,
> stdev=577953.67, samples=55
>    iops        : min=   51, max= 3785, avg=366.98, stdev=564.45, samples=55
>   lat (usec)   : 250=11.87%, 500=12.82%, 750=2.62%, 1000=16.95%
>   lat (msec)   : 2=0.41%, 4=13.14%, 10=42.04%, 20=0.07%, 50=0.06%
>   lat (msec)   : 250=0.01%, 500=0.01%
>   cpu          : usr=0.36%, sys=5.34%, ctx=100156, majf=0, minf=0
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued rwts: total=0,10240,0,1 short=0,0,0,0 dropped=0,0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=1
>
> Run status group 0 (all jobs):
>   WRITE: bw=158MiB/s (166MB/s), 158MiB/s-158MiB/s (166MB/s-166MB/s),
> io=10.0GiB (10.7GB), run=64645-64645msec

-- 
Sitsofe | http://sucs.org/~sits/


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Disk slows down while empting write buffer
  2019-10-02 20:12         ` Sitsofe Wheeler
@ 2019-10-02 23:55           ` Elliott Balsley
  0 siblings, 0 replies; 7+ messages in thread
From: Elliott Balsley @ 2019-10-02 23:55 UTC (permalink / raw)
  To: Sitsofe Wheeler; +Cc: fio

Ok thanks, I opened an issue on github.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-10-02 23:55 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-14 16:22 Disk slows down while empting write buffer Elliott Balsley
2019-09-15  6:42 ` Sitsofe Wheeler
2019-09-15 22:10   ` Elliott Balsley
2019-09-16  6:32     ` Sitsofe Wheeler
2019-10-02  0:40       ` Elliott Balsley
2019-10-02 20:12         ` Sitsofe Wheeler
2019-10-02 23:55           ` Elliott Balsley

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.