linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [QUESTION] Long read latencies on mixed rw buffered IO
       [not found] ` <20190325001044.GA23020@dastard>
@ 2019-03-25  7:49   ` Amir Goldstein
  2019-03-25 15:47     ` Darrick J. Wong
  2019-03-25 23:43     ` Dave Chinner
  0 siblings, 2 replies; 16+ messages in thread
From: Amir Goldstein @ 2019-03-25  7:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Christoph Hellwig, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 10799 bytes --]

On Mon, Mar 25, 2019 at 2:10 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Sun, Mar 24, 2019 at 08:18:10PM +0200, Amir Goldstein wrote:
> > Hi All,
> >
> > Christoph's re-factoring to xfs_ilock() brought up this question,
> > but AFAICS, current behavior seems to have always been that
> > way for xfs (?).
> >
> > Since commit 6552321831dc ("xfs: remove i_iolock and use
> > i_rwsem in the VFS inode instead"), xfs_file_buffered_aio_read()
> > is the only call sites I know of to call generic_file_read_iter() with
> > i_rwsem read side held.
>
> That did not change the locking behaviour of XFS at all.

I know. Said so above.

>
> > This lock is killing performance of multi-threaded buffered
> > read/write mixed workload on the same file [1].
> > Attached output of bcc tools [2] script xfsdist and ext4dist
> > for latency distribution on the same mixed read/write workload.
>
> In future, can you just paste the text in line, rather than attach
> it as base64 encoded attechments? They are kinda hard to quote.
>
> ext4 read :
>
> ....
>     4 -> 7          : 2610
> ....
>  2048 -> 4095       : 232
>  4096 -> 8191       : 632
>  8192 -> 16383      : 925
> 16384 -> 32767      : 408
> ....
>
> ext4 write:
>
>     4 -> 7          : 16732
>     8 -> 15         : 81712
>    16 -> 31         : 121450
>    32 -> 63         : 29638
>
> So about 4500 read IOPS, of which >50% are cache hits.
> And ~250,000 write iops, all in less than 64 microseconds so
> straight into cache. So we are looking at a ~50:1 write to read
> ratio here.
>
> XFS read:
>
>          0 -> 1          : 8
>          2 -> 3          : 21
>          4 -> 7          : 17
>  ....
>       8192 -> 16383      : 4
>      16384 -> 32767      : 8
>      32768 -> 65535      : 17
>      65536 -> 131071     : 55
>     131072 -> 262143     : 41
>     262144 -> 524287     : 104
>     524288 -> 1048575    : 101
>    1048576 -> 2097151    : 34
>  ....
>
> XFS Write:
>          4 -> 7          : 10087
>          8 -> 15         : 34815
>         16 -> 31         : 32446
>         32 -> 63         : 3888
>         64 -> 127        : 155
>  ....
>         <snip long tail>
>
> Which shows ~400 read ops (almost no cache hits)

It's not because of almost no cache hits.
The pattern is the same as in ext4 test.
It's because most reads are blocked on writes in slow path,
as following sample stack stace from offwaketime bcc tool shows:

Tracing blocked time (us) by user + kernel off-CPU and waker stack...

    waker:           filebench 32155
    entry_SYSCALL_64_after_hwframe
    do_syscall_64
    sys_pwrite64
    vfs_write
    __vfs_write
    new_sync_write
    xfs_file_write_iter
    xfs_file_buffered_aio_write
    xfs_iunlock
    up_write
    call_rwsem_wake
    rwsem_wake
    --               --
    finish_task_switch
    schedule
    rwsem_down_read_failed
    call_rwsem_down_read_failed
    down_read
    xfs_ilock
    xfs_file_buffered_aio_read
    xfs_file_read_iter
    new_sync_read
    __vfs_read
    vfs_read
    sys_pread64
    do_syscall_64
    entry_SYSCALL_64_after_hwframe
    pread
    target:          filebench 32147
        109242

> and ~80,000 write
> ops with a long latency tail. It's a roughly a 200:1 write to read
> ratio.
>
> I'm guessing what we are seeing here is a rwsem starvation problem.
> The only reason a read would get delayed for mor ethan a few hundred
> microseconds is ifthere are a continual stream of write holders
> starving pending reads.
>

Yes. That seems to be the case.

> I don't recall it being this bad historically but rwsems have been
> severely performance optimised for various corner cases over the
> past few years and it is extremely likely that behaviour has
> changed.  Definitely needs looking at, but rwsems are so fragile
> these days (highly dependent on memory ordering that nobody seems to
> be able to get right) and impossible to validate as correct that we
> might just end up having to live with it.
>
> FWIW, What kernel did you test this on?

The results above are from kernel 4.20.7-1.el7.elrepo.x86_64.
Your assumption about change of behavior of rwsem over the
years seems to be correct.
The original report we got from customer was with kernel 4.9.134
In the original report, long latencies were on writes rather than on reads.
The following output of filebench randomrw workload summarizes the results:

rand-write1          1128ops       19ops/s   0.1mb/s    340.3ms/op
[0.03ms - 7933.27ms]
rand-read1           2954684ops    49238ops/s 384.7mb/s      0.2ms/op
[0.02ms - 5873.27ms]
61.026: IO Summary: 2955812 ops 49257.253 ops/s 49238/19 rd/wr
384.8mb/s   0.3ms/op

With more recent kernel (4.20), the latencies have shifted to the read ops:

rand-write1          467706ops     7794ops/s  60.9mb/s    0.733ms/op
[0.004ms - 3899.167ms]
rand-read1           1970ops       33ops/s   0.3mb/s  164.277ms/op
[0.017ms - 3905.929ms]
61.032: IO Summary: 469676 ops 7826.760 ops/s 33/7794 rd/wr  61.1mb/s 1.419ms/op

In a workload of 100% reads or 100% writes, there are no extreme
latencies or neither
old nor new kernel.

>
> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
>

Granted, I wasn't reported a problem is the traditional sense of "what is
the cause?" - that I know. I was asking why this change between xfs
and other fs.
I gathered a lot of traces and analysis from different tools and on
different kernels
during the investigation, so pouring them at all once would have been counter
productive. I'll happily provide any required analysis per demand.

> > Compared to ext4, avg. read latency on RAID of spindles
> > can be two orders of magnitude higher (>100ms).
> > I can provide more performance numbers if needed with fio,
> > but they won't surprise anyone considering the extra lock.
> >
> > This workload simulates a performance issue we are seeing
> > on deployed systems. There are other ways for us to work
> > around the issue, not using xfs on those systems would be
> > one way, but I wanted to figure out the reason for this
> > behavior first.
>
> The workload is 8 threads doing random 8k random reads and
> 8 threads doing random writes to a single 5GB file?

Yes. To a preallocated file.

>
> I'd be interested in seeing if a fio workload using 16 randrw
> threads demonstrate the same latency profile (i.e. threads aren't
> exclusively read- or write- only), or is profile coming about
> specifically because you have dedicated write-only threads that will
> continually bash on the IO locks and hence starve readers?

I guess it depends on the rw mix. With 50% r/w mix run on kernel
4.20.7-1.el7, long latencies are mostly on the write ops.

Gist of fio output inline and full output attached (Sorry, for folks who
don't like scrolling down):

---XFS---
fio -filename=/xfsfolder/largefile1/00000001/00000001 -direct=0
-iodepth 1 -thread -rw=randrw -rwmixread=50 -ioengine=psync -bs=8k
-size=5G -numjobs=16 -runtime=60 -group_reporting -name=randrw_16th_8k
...
   read: IOPS=264, BW=2117KiB/s (2168kB/s)(124MiB/60045msec)
    clat (usec): min=148, max=491768, avg=39432.35, stdev=23039.80
...
  write: IOPS=268, BW=2149KiB/s (2201kB/s)(126MiB/60045msec)
    clat (usec): min=4, max=477942, avg=20673.33, stdev=26613.06

---EXT4---
fio -filename=/ext4folder/largefile1/00000001/00000001 -direct=0
-iodepth 1 -thread -rw=randrw -rwmixread=50 -ioengine=psync -bs=8k
-size=5G -numjobs=16 -runtime=60 -group_reporting -name=randrw_16th_8k
...
   read: IOPS=1409, BW=11.0MiB/s (11.5MB/s)(662MiB/60100msec)
    clat (usec): min=3, max=1488.6k, avg=11313.66, stdev=22312.95
...
  write: IOPS=1418, BW=11.1MiB/s (11.6MB/s)(666MiB/60100msec)
    clat (usec): min=4, max=7202, avg=19.68, stdev=40.91

>
> > My question is, is the purpose of this lock syncing
> > dio/buffered io?
>
> That's one part of it. The other is POSIX atomic write semantics.
>
> https://pubs.opengroup.org/onlinepubs/009695399/functions/read.html
>
> "I/O is intended to be atomic to ordinary files and pipes and FIFOs.
> Atomic means that all the bytes from a single operation that started
> out together end up together, without interleaving from other I/O
> operations."
>
> i.e. that independent read()s should see a write() as a single
> atomic change. hence if you do a read() concurrently with a write(),
> the read should either run to completion before the write, or the
> write run to completion before the read().
>
> XFS is the only linux filesystem that provides this behaviour.
>
> ext4 and other filesytsems tha thave no buffered read side locking
> provide serialisation at page level, which means an 8kB read racing
> with a 8kB write can return 4k from the new write and 4k from the
> yet-to-be-written part of the file. IOWs, the buffered read gets
> torn.
>

I see... Are you aware of any applications that depend on this POSIX
atomic write behavior? It would seem very unlikely that such applications
exist, if only xfs provides this guaranty on Linux...
IMO, it would be best if applications that desire atomic writes semantics
would opt for it using a preadv2 flag. Or, for backward compat, we can allow
applications to opt out of atomic write semantics. [CC fsdevel]

> > If so, was making this behavior optional via mount option
> > ever considered for xfs?
> > Am I the first one who is asking about this specific workload
> > on xfs (Couldn't found anything on Google), or is this a known
> > issue/trade off/design choice of xfs?
>
> It's an original design feature of XFS. See this 1996 usenix paper
> on XFS, section 6.2 - Performing File I/O , "Using Multiple
> Processes":
>
> "Currently, when using direct I/O and multiple writers, we
> place the burden of serializing writes to the same
> region of the file on the application. This differs from
> traditional Unix file I/O where file writes are atomic
> with respect to other file accesses, and it is one of the
> main reasons why we do not yet support multiple
> writers using traditional Unix file I/O."
>
> i.e. XFS was designed with the intent that buffered writes are
> atomic w.r.t. to all other file accesses. That's not to say we can't
> change it, just that it has always been different to what linux
> native filesystems do. And depending on which set of application
> developers you talk to, you'll get different answers as to whether
> they want write()s to be atomic....
>

All right. I am reading the above as no prior objection to adding
an XFS mount option and/or preadv2() flag to opt out of this
behavior and align with the reckless behavior of the other local
filesystems on Linux.
Please correct me if I am wrong and I would like to hear what
other people thing w.r.t mount option vs. preadv2() flag.
It could also be an open flag, but I prefer not to go down that road...

Thanks,
Amir.

[-- Attachment #2: fio.xfs.randrw_16th_8k.txt --]
[-- Type: text/plain, Size: 2687 bytes --]

randrw_16th_8k: (g=0): rw=randrw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=psync, iodepth=1
...
fio-3.1
Starting 16 threads

randrw_16th_8k: (groupid=0, jobs=16): err= 0: pid=5320: Mon Mar 25 03:21:00 2019
   read: IOPS=264, BW=2117KiB/s (2168kB/s)(124MiB/60045msec)
    clat (usec): min=148, max=491768, avg=39432.35, stdev=23039.80
     lat (usec): min=148, max=491768, avg=39432.76, stdev=23039.81
    clat percentiles (msec):
     |  1.00th=[   12],  5.00th=[   18], 10.00th=[   21], 20.00th=[   26],
     | 30.00th=[   29], 40.00th=[   32], 50.00th=[   35], 60.00th=[   39],
     | 70.00th=[   43], 80.00th=[   50], 90.00th=[   64], 95.00th=[   80],
     | 99.00th=[  116], 99.50th=[  133], 99.90th=[  247], 99.95th=[  439],
     | 99.99th=[  485]
   bw (  KiB/s): min=   16, max=  272, per=6.27%, avg=132.82, stdev=48.78, samples=1913
   iops        : min=    2, max=   34, avg=16.60, stdev= 6.10, samples=1913
  write: IOPS=268, BW=2149KiB/s (2201kB/s)(126MiB/60045msec)
    clat (usec): min=4, max=477942, avg=20673.33, stdev=26613.06
     lat (usec): min=4, max=477952, avg=20673.98, stdev=26613.15
    clat percentiles (usec):
     |  1.00th=[     7],  5.00th=[     8], 10.00th=[     8], 20.00th=[    12],
     | 30.00th=[    22], 40.00th=[    31], 50.00th=[ 17957], 60.00th=[ 26346],
     | 70.00th=[ 31851], 80.00th=[ 38011], 90.00th=[ 50594], 95.00th=[ 65799],
     | 99.00th=[103285], 99.50th=[115868], 99.90th=[196084], 99.95th=[270533],
     | 99.99th=[476054]
   bw (  KiB/s): min=   16, max=  384, per=6.27%, avg=134.82, stdev=66.25, samples=1914
   iops        : min=    2, max=   48, avg=16.85, stdev= 8.28, samples=1914
  lat (usec)   : 10=8.31%, 20=6.12%, 50=9.41%, 100=0.68%, 250=0.02%
  lat (usec)   : 500=0.01%, 750=0.01%
  lat (msec)   : 2=0.01%, 4=0.07%, 10=0.27%, 20=5.26%, 50=55.06%
  lat (msec)   : 100=13.19%, 250=1.52%, 500=0.07%
  cpu          : usr=0.03%, sys=0.25%, ctx=41774, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=15893,16131,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=2117KiB/s (2168kB/s), 2117KiB/s-2117KiB/s (2168kB/s-2168kB/s), io=124MiB (130MB), run=60045-60045msec
  WRITE: bw=2149KiB/s (2201kB/s), 2149KiB/s-2149KiB/s (2201kB/s-2201kB/s), io=126MiB (132MB), run=60045-60045msec

Disk stats (read/write):
  sdc: ios=15435/10391, merge=0/1, ticks=122505/4115, in_queue=126193, util=98.26%

[-- Attachment #3: fio.ext4.randrw_16th_8k.txt --]
[-- Type: text/plain, Size: 2750 bytes --]

randrw_16th_8k: (g=0): rw=randrw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=psync, iodepth=1
...
fio-3.1
Starting 16 threads

randrw_16th_8k: (groupid=0, jobs=16): err= 0: pid=5294: Mon Mar 25 03:15:49 2019
   read: IOPS=1409, BW=11.0MiB/s (11.5MB/s)(662MiB/60100msec)
    clat (usec): min=3, max=1488.6k, avg=11313.66, stdev=22312.95
     lat (usec): min=3, max=1488.6k, avg=11313.90, stdev=22312.95
    clat percentiles (usec):
     |  1.00th=[     5],  5.00th=[     6], 10.00th=[     8], 20.00th=[  3392],
     | 30.00th=[  5145], 40.00th=[  6718], 50.00th=[  8094], 60.00th=[  9503],
     | 70.00th=[ 11338], 80.00th=[ 14484], 90.00th=[ 21103], 95.00th=[ 29754],
     | 99.00th=[ 68682], 99.50th=[109577], 99.90th=[278922], 99.95th=[379585],
     | 99.99th=[893387]
   bw (  KiB/s): min=   32, max= 1202, per=6.30%, avg=710.12, stdev=188.47, samples=1909
   iops        : min=    4, max=  150, avg=88.72, stdev=23.54, samples=1909
  write: IOPS=1418, BW=11.1MiB/s (11.6MB/s)(666MiB/60100msec)
    clat (usec): min=4, max=7202, avg=19.68, stdev=40.91
     lat (usec): min=5, max=7203, avg=20.09, stdev=40.97
    clat percentiles (usec):
     |  1.00th=[    7],  5.00th=[   10], 10.00th=[   12], 20.00th=[   12],
     | 30.00th=[   13], 40.00th=[   15], 50.00th=[   18], 60.00th=[   20],
     | 70.00th=[   22], 80.00th=[   26], 90.00th=[   32], 95.00th=[   37],
     | 99.00th=[   56], 99.50th=[   64], 99.90th=[   91], 99.95th=[  117],
     | 99.99th=[  379]
   bw (  KiB/s): min=   16, max= 1523, per=6.30%, avg=715.24, stdev=247.64, samples=1908
   iops        : min=    2, max=  190, avg=89.36, stdev=30.93, samples=1908
  lat (usec)   : 4=0.02%, 10=8.31%, 20=29.16%, 50=17.64%, 100=0.69%
  lat (usec)   : 250=0.04%, 500=0.66%, 750=0.08%, 1000=0.02%
  lat (msec)   : 2=0.55%, 4=4.52%, 10=20.33%, 20=12.50%, 50=4.56%
  lat (msec)   : 100=0.63%, 250=0.21%, 500=0.05%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2000=0.01%
  cpu          : usr=0.09%, sys=0.48%, ctx=75050, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=84702,85250,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=11.0MiB/s (11.5MB/s), 11.0MiB/s-11.0MiB/s (11.5MB/s-11.5MB/s), io=662MiB (694MB), run=60100-60100msec
  WRITE: bw=11.1MiB/s (11.6MB/s), 11.1MiB/s-11.1MiB/s (11.6MB/s-11.6MB/s), io=666MiB (698MB), run=60100-60100msec

Disk stats (read/write):
  sdb: ios=74911/60641, merge=0/13, ticks=950347/34711, in_queue=987372, util=100.00%

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [QUESTION] Long read latencies on mixed rw buffered IO
  2019-03-25  7:49   ` [QUESTION] Long read latencies on mixed rw buffered IO Amir Goldstein
@ 2019-03-25 15:47     ` Darrick J. Wong
  2019-03-25 16:41       ` Matthew Wilcox
  2019-03-25 17:56       ` Amir Goldstein
  2019-03-25 23:43     ` Dave Chinner
  1 sibling, 2 replies; 16+ messages in thread
From: Darrick J. Wong @ 2019-03-25 15:47 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Dave Chinner, linux-xfs, Christoph Hellwig, linux-fsdevel

On Mon, Mar 25, 2019 at 09:49:33AM +0200, Amir Goldstein wrote:
> On Mon, Mar 25, 2019 at 2:10 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Sun, Mar 24, 2019 at 08:18:10PM +0200, Amir Goldstein wrote:
> > > Hi All,
> > >
> > > Christoph's re-factoring to xfs_ilock() brought up this question,
> > > but AFAICS, current behavior seems to have always been that
> > > way for xfs (?).
> > >
> > > Since commit 6552321831dc ("xfs: remove i_iolock and use
> > > i_rwsem in the VFS inode instead"), xfs_file_buffered_aio_read()
> > > is the only call sites I know of to call generic_file_read_iter() with
> > > i_rwsem read side held.
> >
> > That did not change the locking behaviour of XFS at all.
> 
> I know. Said so above.
> 
> >
> > > This lock is killing performance of multi-threaded buffered
> > > read/write mixed workload on the same file [1].
> > > Attached output of bcc tools [2] script xfsdist and ext4dist
> > > for latency distribution on the same mixed read/write workload.
> >
> > In future, can you just paste the text in line, rather than attach
> > it as base64 encoded attechments? They are kinda hard to quote.
> >
> > ext4 read :
> >
> > ....
> >     4 -> 7          : 2610
> > ....
> >  2048 -> 4095       : 232
> >  4096 -> 8191       : 632
> >  8192 -> 16383      : 925
> > 16384 -> 32767      : 408
> > ....
> >
> > ext4 write:
> >
> >     4 -> 7          : 16732
> >     8 -> 15         : 81712
> >    16 -> 31         : 121450
> >    32 -> 63         : 29638
> >
> > So about 4500 read IOPS, of which >50% are cache hits.
> > And ~250,000 write iops, all in less than 64 microseconds so
> > straight into cache. So we are looking at a ~50:1 write to read
> > ratio here.
> >
> > XFS read:
> >
> >          0 -> 1          : 8
> >          2 -> 3          : 21
> >          4 -> 7          : 17
> >  ....
> >       8192 -> 16383      : 4
> >      16384 -> 32767      : 8
> >      32768 -> 65535      : 17
> >      65536 -> 131071     : 55
> >     131072 -> 262143     : 41
> >     262144 -> 524287     : 104
> >     524288 -> 1048575    : 101
> >    1048576 -> 2097151    : 34
> >  ....
> >
> > XFS Write:
> >          4 -> 7          : 10087
> >          8 -> 15         : 34815
> >         16 -> 31         : 32446
> >         32 -> 63         : 3888
> >         64 -> 127        : 155
> >  ....
> >         <snip long tail>
> >
> > Which shows ~400 read ops (almost no cache hits)
> 
> It's not because of almost no cache hits.
> The pattern is the same as in ext4 test.
> It's because most reads are blocked on writes in slow path,
> as following sample stack stace from offwaketime bcc tool shows:
> 
> Tracing blocked time (us) by user + kernel off-CPU and waker stack...
> 
>     waker:           filebench 32155
>     entry_SYSCALL_64_after_hwframe
>     do_syscall_64
>     sys_pwrite64
>     vfs_write
>     __vfs_write
>     new_sync_write
>     xfs_file_write_iter
>     xfs_file_buffered_aio_write
>     xfs_iunlock
>     up_write
>     call_rwsem_wake
>     rwsem_wake
>     --               --
>     finish_task_switch
>     schedule
>     rwsem_down_read_failed
>     call_rwsem_down_read_failed
>     down_read
>     xfs_ilock
>     xfs_file_buffered_aio_read
>     xfs_file_read_iter
>     new_sync_read
>     __vfs_read
>     vfs_read
>     sys_pread64
>     do_syscall_64
>     entry_SYSCALL_64_after_hwframe
>     pread
>     target:          filebench 32147
>         109242
> 
> > and ~80,000 write
> > ops with a long latency tail. It's a roughly a 200:1 write to read
> > ratio.
> >
> > I'm guessing what we are seeing here is a rwsem starvation problem.
> > The only reason a read would get delayed for mor ethan a few hundred
> > microseconds is ifthere are a continual stream of write holders
> > starving pending reads.
> >
> 
> Yes. That seems to be the case.
> 
> > I don't recall it being this bad historically but rwsems have been
> > severely performance optimised for various corner cases over the
> > past few years and it is extremely likely that behaviour has
> > changed.  Definitely needs looking at, but rwsems are so fragile
> > these days (highly dependent on memory ordering that nobody seems to
> > be able to get right) and impossible to validate as correct that we
> > might just end up having to live with it.
> >
> > FWIW, What kernel did you test this on?
> 
> The results above are from kernel 4.20.7-1.el7.elrepo.x86_64.
> Your assumption about change of behavior of rwsem over the
> years seems to be correct.
> The original report we got from customer was with kernel 4.9.134
> In the original report, long latencies were on writes rather than on reads.
> The following output of filebench randomrw workload summarizes the results:
> 
> rand-write1          1128ops       19ops/s   0.1mb/s    340.3ms/op
> [0.03ms - 7933.27ms]
> rand-read1           2954684ops    49238ops/s 384.7mb/s      0.2ms/op
> [0.02ms - 5873.27ms]
> 61.026: IO Summary: 2955812 ops 49257.253 ops/s 49238/19 rd/wr
> 384.8mb/s   0.3ms/op
> 
> With more recent kernel (4.20), the latencies have shifted to the read ops:
> 
> rand-write1          467706ops     7794ops/s  60.9mb/s    0.733ms/op
> [0.004ms - 3899.167ms]
> rand-read1           1970ops       33ops/s   0.3mb/s  164.277ms/op
> [0.017ms - 3905.929ms]
> 61.032: IO Summary: 469676 ops 7826.760 ops/s 33/7794 rd/wr  61.1mb/s 1.419ms/op
> 
> In a workload of 100% reads or 100% writes, there are no extreme
> latencies or neither
> old nor new kernel.

Hmmm.... so it looks like the rw_semaphore behavior has shifted over
time, then?

I thought rwsem was supposed to queue read and write waiters in order,
at least on x86?  Though I suppose that might not matter much since we
can only run one writer at a time vs. waking up all the readers at once.
Now I'm wondering if there ever was a time when the readers all got
batched to the front and starved the writers, but eh I haven't drank
enough coffee to remember things like that. :P

(I wonder what would happen if rw_semaphore decided to wake up some
number of the readers in the rwsem wait_list, not just the ones at the
front...)

> >
> > http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
> >
> 
> Granted, I wasn't reported a problem is the traditional sense of "what is
> the cause?" - that I know. I was asking why this change between xfs
> and other fs.
> I gathered a lot of traces and analysis from different tools and on
> different kernels
> during the investigation, so pouring them at all once would have been counter
> productive. I'll happily provide any required analysis per demand.
> 
> > > Compared to ext4, avg. read latency on RAID of spindles
> > > can be two orders of magnitude higher (>100ms).
> > > I can provide more performance numbers if needed with fio,
> > > but they won't surprise anyone considering the extra lock.
> > >
> > > This workload simulates a performance issue we are seeing
> > > on deployed systems. There are other ways for us to work
> > > around the issue, not using xfs on those systems would be
> > > one way, but I wanted to figure out the reason for this
> > > behavior first.
> >
> > The workload is 8 threads doing random 8k random reads and
> > 8 threads doing random writes to a single 5GB file?
> 
> Yes. To a preallocated file.
> 
> >
> > I'd be interested in seeing if a fio workload using 16 randrw
> > threads demonstrate the same latency profile (i.e. threads aren't
> > exclusively read- or write- only), or is profile coming about
> > specifically because you have dedicated write-only threads that will
> > continually bash on the IO locks and hence starve readers?
> 
> I guess it depends on the rw mix. With 50% r/w mix run on kernel
> 4.20.7-1.el7, long latencies are mostly on the write ops.
> 
> Gist of fio output inline and full output attached (Sorry, for folks who
> don't like scrolling down):
> 
> ---XFS---
> fio -filename=/xfsfolder/largefile1/00000001/00000001 -direct=0
> -iodepth 1 -thread -rw=randrw -rwmixread=50 -ioengine=psync -bs=8k
> -size=5G -numjobs=16 -runtime=60 -group_reporting -name=randrw_16th_8k
> ...
>    read: IOPS=264, BW=2117KiB/s (2168kB/s)(124MiB/60045msec)
>     clat (usec): min=148, max=491768, avg=39432.35, stdev=23039.80
> ...
>   write: IOPS=268, BW=2149KiB/s (2201kB/s)(126MiB/60045msec)
>     clat (usec): min=4, max=477942, avg=20673.33, stdev=26613.06
> 
> ---EXT4---
> fio -filename=/ext4folder/largefile1/00000001/00000001 -direct=0
> -iodepth 1 -thread -rw=randrw -rwmixread=50 -ioengine=psync -bs=8k
> -size=5G -numjobs=16 -runtime=60 -group_reporting -name=randrw_16th_8k
> ...
>    read: IOPS=1409, BW=11.0MiB/s (11.5MB/s)(662MiB/60100msec)
>     clat (usec): min=3, max=1488.6k, avg=11313.66, stdev=22312.95
> ...
>   write: IOPS=1418, BW=11.1MiB/s (11.6MB/s)(666MiB/60100msec)
>     clat (usec): min=4, max=7202, avg=19.68, stdev=40.91
> 
> >
> > > My question is, is the purpose of this lock syncing
> > > dio/buffered io?
> >
> > That's one part of it. The other is POSIX atomic write semantics.
> >
> > https://pubs.opengroup.org/onlinepubs/009695399/functions/read.html
> >
> > "I/O is intended to be atomic to ordinary files and pipes and FIFOs.
> > Atomic means that all the bytes from a single operation that started
> > out together end up together, without interleaving from other I/O
> > operations."
> >
> > i.e. that independent read()s should see a write() as a single
> > atomic change. hence if you do a read() concurrently with a write(),
> > the read should either run to completion before the write, or the
> > write run to completion before the read().
> >
> > XFS is the only linux filesystem that provides this behaviour.
> >
> > ext4 and other filesytsems tha thave no buffered read side locking
> > provide serialisation at page level, which means an 8kB read racing
> > with a 8kB write can return 4k from the new write and 4k from the
> > yet-to-be-written part of the file. IOWs, the buffered read gets
> > torn.
> >
> 
> I see... Are you aware of any applications that depend on this POSIX
> atomic write behavior? It would seem very unlikely that such applications
> exist, if only xfs provides this guaranty on Linux...

If we remove the locks that establish the atomic io call behavior, it
seems likely to me that we'll find out which applications actually did
require the behavior and recommended XFS because of its unique
abilities.  Only now they'll be mad.

> IMO, it would be best if applications that desire atomic writes semantics
> would opt for it using a preadv2 flag. Or, for backward compat, we can allow
> applications to opt out of atomic write semantics. [CC fsdevel]
> 
> > > If so, was making this behavior optional via mount option
> > > ever considered for xfs?
> > > Am I the first one who is asking about this specific workload
> > > on xfs (Couldn't found anything on Google), or is this a known
> > > issue/trade off/design choice of xfs?
> >
> > It's an original design feature of XFS. See this 1996 usenix paper
> > on XFS, section 6.2 - Performing File I/O , "Using Multiple
> > Processes":
> >
> > "Currently, when using direct I/O and multiple writers, we
> > place the burden of serializing writes to the same
> > region of the file on the application. This differs from
> > traditional Unix file I/O where file writes are atomic
> > with respect to other file accesses, and it is one of the
> > main reasons why we do not yet support multiple
> > writers using traditional Unix file I/O."
> >
> > i.e. XFS was designed with the intent that buffered writes are
> > atomic w.r.t. to all other file accesses. That's not to say we can't
> > change it, just that it has always been different to what linux
> > native filesystems do. And depending on which set of application
> > developers you talk to, you'll get different answers as to whether
> > they want write()s to be atomic....
> >
> 
> All right. I am reading the above as no prior objection to adding
> an XFS mount option and/or preadv2() flag to opt out of this
> behavior

Reads and writes are not the only thing xfs uses i_rwsem to synchronise.
Reflink remap uses it to make sure everything's flushed to disk and that
page cache contents remain clean while the remap is ongoing.  I'm pretty
sure pnfs uses it for similar reasons when granting and committing write
leases.

> and align with the reckless behavior of the other local
> filesystems on Linux.
> Please correct me if I am wrong and I would like to hear what
> other people thing w.r.t mount option vs. preadv2() flag.
> It could also be an open flag, but I prefer not to go down that road...

I don't like the idea of adding a O_BROKENLOCKINGPONIES flag because it
adds more QA work to support a new (for xfs) violation of the posix spec
we (supposedly) fulfill.

If rw_semaphore favors writes too heavily, why not fix that?

<mutter mutter mutter>

--D

> Thanks,
> Amir.

> randrw_16th_8k: (g=0): rw=randrw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=psync, iodepth=1
> ...
> fio-3.1
> Starting 16 threads
> 
> randrw_16th_8k: (groupid=0, jobs=16): err= 0: pid=5320: Mon Mar 25 03:21:00 2019
>    read: IOPS=264, BW=2117KiB/s (2168kB/s)(124MiB/60045msec)
>     clat (usec): min=148, max=491768, avg=39432.35, stdev=23039.80
>      lat (usec): min=148, max=491768, avg=39432.76, stdev=23039.81
>     clat percentiles (msec):
>      |  1.00th=[   12],  5.00th=[   18], 10.00th=[   21], 20.00th=[   26],
>      | 30.00th=[   29], 40.00th=[   32], 50.00th=[   35], 60.00th=[   39],
>      | 70.00th=[   43], 80.00th=[   50], 90.00th=[   64], 95.00th=[   80],
>      | 99.00th=[  116], 99.50th=[  133], 99.90th=[  247], 99.95th=[  439],
>      | 99.99th=[  485]
>    bw (  KiB/s): min=   16, max=  272, per=6.27%, avg=132.82, stdev=48.78, samples=1913
>    iops        : min=    2, max=   34, avg=16.60, stdev= 6.10, samples=1913
>   write: IOPS=268, BW=2149KiB/s (2201kB/s)(126MiB/60045msec)
>     clat (usec): min=4, max=477942, avg=20673.33, stdev=26613.06
>      lat (usec): min=4, max=477952, avg=20673.98, stdev=26613.15
>     clat percentiles (usec):
>      |  1.00th=[     7],  5.00th=[     8], 10.00th=[     8], 20.00th=[    12],
>      | 30.00th=[    22], 40.00th=[    31], 50.00th=[ 17957], 60.00th=[ 26346],
>      | 70.00th=[ 31851], 80.00th=[ 38011], 90.00th=[ 50594], 95.00th=[ 65799],
>      | 99.00th=[103285], 99.50th=[115868], 99.90th=[196084], 99.95th=[270533],
>      | 99.99th=[476054]
>    bw (  KiB/s): min=   16, max=  384, per=6.27%, avg=134.82, stdev=66.25, samples=1914
>    iops        : min=    2, max=   48, avg=16.85, stdev= 8.28, samples=1914
>   lat (usec)   : 10=8.31%, 20=6.12%, 50=9.41%, 100=0.68%, 250=0.02%
>   lat (usec)   : 500=0.01%, 750=0.01%
>   lat (msec)   : 2=0.01%, 4=0.07%, 10=0.27%, 20=5.26%, 50=55.06%
>   lat (msec)   : 100=13.19%, 250=1.52%, 500=0.07%
>   cpu          : usr=0.03%, sys=0.25%, ctx=41774, majf=0, minf=0
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued rwt: total=15893,16131,0, short=0,0,0, dropped=0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>    READ: bw=2117KiB/s (2168kB/s), 2117KiB/s-2117KiB/s (2168kB/s-2168kB/s), io=124MiB (130MB), run=60045-60045msec
>   WRITE: bw=2149KiB/s (2201kB/s), 2149KiB/s-2149KiB/s (2201kB/s-2201kB/s), io=126MiB (132MB), run=60045-60045msec
> 
> Disk stats (read/write):
>   sdc: ios=15435/10391, merge=0/1, ticks=122505/4115, in_queue=126193, util=98.26%

> randrw_16th_8k: (g=0): rw=randrw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=psync, iodepth=1
> ...
> fio-3.1
> Starting 16 threads
> 
> randrw_16th_8k: (groupid=0, jobs=16): err= 0: pid=5294: Mon Mar 25 03:15:49 2019
>    read: IOPS=1409, BW=11.0MiB/s (11.5MB/s)(662MiB/60100msec)
>     clat (usec): min=3, max=1488.6k, avg=11313.66, stdev=22312.95
>      lat (usec): min=3, max=1488.6k, avg=11313.90, stdev=22312.95
>     clat percentiles (usec):
>      |  1.00th=[     5],  5.00th=[     6], 10.00th=[     8], 20.00th=[  3392],
>      | 30.00th=[  5145], 40.00th=[  6718], 50.00th=[  8094], 60.00th=[  9503],
>      | 70.00th=[ 11338], 80.00th=[ 14484], 90.00th=[ 21103], 95.00th=[ 29754],
>      | 99.00th=[ 68682], 99.50th=[109577], 99.90th=[278922], 99.95th=[379585],
>      | 99.99th=[893387]
>    bw (  KiB/s): min=   32, max= 1202, per=6.30%, avg=710.12, stdev=188.47, samples=1909
>    iops        : min=    4, max=  150, avg=88.72, stdev=23.54, samples=1909
>   write: IOPS=1418, BW=11.1MiB/s (11.6MB/s)(666MiB/60100msec)
>     clat (usec): min=4, max=7202, avg=19.68, stdev=40.91
>      lat (usec): min=5, max=7203, avg=20.09, stdev=40.97
>     clat percentiles (usec):
>      |  1.00th=[    7],  5.00th=[   10], 10.00th=[   12], 20.00th=[   12],
>      | 30.00th=[   13], 40.00th=[   15], 50.00th=[   18], 60.00th=[   20],
>      | 70.00th=[   22], 80.00th=[   26], 90.00th=[   32], 95.00th=[   37],
>      | 99.00th=[   56], 99.50th=[   64], 99.90th=[   91], 99.95th=[  117],
>      | 99.99th=[  379]
>    bw (  KiB/s): min=   16, max= 1523, per=6.30%, avg=715.24, stdev=247.64, samples=1908
>    iops        : min=    2, max=  190, avg=89.36, stdev=30.93, samples=1908
>   lat (usec)   : 4=0.02%, 10=8.31%, 20=29.16%, 50=17.64%, 100=0.69%
>   lat (usec)   : 250=0.04%, 500=0.66%, 750=0.08%, 1000=0.02%
>   lat (msec)   : 2=0.55%, 4=4.52%, 10=20.33%, 20=12.50%, 50=4.56%
>   lat (msec)   : 100=0.63%, 250=0.21%, 500=0.05%, 750=0.01%, 1000=0.01%
>   lat (msec)   : 2000=0.01%
>   cpu          : usr=0.09%, sys=0.48%, ctx=75050, majf=0, minf=0
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued rwt: total=84702,85250,0, short=0,0,0, dropped=0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>    READ: bw=11.0MiB/s (11.5MB/s), 11.0MiB/s-11.0MiB/s (11.5MB/s-11.5MB/s), io=662MiB (694MB), run=60100-60100msec
>   WRITE: bw=11.1MiB/s (11.6MB/s), 11.1MiB/s-11.1MiB/s (11.6MB/s-11.6MB/s), io=666MiB (698MB), run=60100-60100msec
> 
> Disk stats (read/write):
>   sdb: ios=74911/60641, merge=0/13, ticks=950347/34711, in_queue=987372, util=100.00%


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [QUESTION] Long read latencies on mixed rw buffered IO
  2019-03-25 15:47     ` Darrick J. Wong
@ 2019-03-25 16:41       ` Matthew Wilcox
  2019-03-25 17:30         ` Amir Goldstein
  2019-03-25 17:56       ` Amir Goldstein
  1 sibling, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2019-03-25 16:41 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Amir Goldstein, Dave Chinner, linux-xfs, Christoph Hellwig,
	linux-fsdevel

On Mon, Mar 25, 2019 at 08:47:31AM -0700, Darrick J. Wong wrote:
> Hmmm.... so it looks like the rw_semaphore behavior has shifted over
> time, then?

Yes.

> I thought rwsem was supposed to queue read and write waiters in order,
> at least on x86?  Though I suppose that might not matter much since we
> can only run one writer at a time vs. waking up all the readers at once.
> Now I'm wondering if there ever was a time when the readers all got
> batched to the front and starved the writers, but eh I haven't drank
> enough coffee to remember things like that. :P
> 
> (I wonder what would happen if rw_semaphore decided to wake up some
> number of the readers in the rwsem wait_list, not just the ones at the
> front...)

rwsems currently allow a limited amount of queue-jumping; if a semaphore
is currently not acquired (it's in transition between two owners), a
running process can acquire it.

I think it is a bug that we only wake readers at the front of the queue;
I think we would get better performance if we wake all readers.  ie here:

        /*
-        * Grant an infinite number of read locks to the readers at the front
-        * of the queue. We know that woken will be at least 1 as we accounted
+	 * Grant an infinite number of read locks.  We know that woken will
+	 * be at least 1 as we accounted  
         * for above. Note we increment the 'active part' of the count by the
         * number of readers before waking any processes up.
         */
        list_for_each_entry_safe(waiter, tmp, &sem->wait_list, list) {
                struct task_struct *tsk;

-               if (waiter->type == RWSEM_WAITING_FOR_WRITE)
-                       break;

Amir, it seems like you have a good test-case for trying this out ...

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [QUESTION] Long read latencies on mixed rw buffered IO
  2019-03-25 16:41       ` Matthew Wilcox
@ 2019-03-25 17:30         ` Amir Goldstein
  2019-03-25 18:22           ` Matthew Wilcox
  0 siblings, 1 reply; 16+ messages in thread
From: Amir Goldstein @ 2019-03-25 17:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Darrick J. Wong, Dave Chinner, linux-xfs, Christoph Hellwig,
	linux-fsdevel

On Mon, Mar 25, 2019 at 6:41 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Mar 25, 2019 at 08:47:31AM -0700, Darrick J. Wong wrote:
> > Hmmm.... so it looks like the rw_semaphore behavior has shifted over
> > time, then?
>
> Yes.
>
> > I thought rwsem was supposed to queue read and write waiters in order,
> > at least on x86?  Though I suppose that might not matter much since we
> > can only run one writer at a time vs. waking up all the readers at once.
> > Now I'm wondering if there ever was a time when the readers all got
> > batched to the front and starved the writers, but eh I haven't drank
> > enough coffee to remember things like that. :P
> >
> > (I wonder what would happen if rw_semaphore decided to wake up some
> > number of the readers in the rwsem wait_list, not just the ones at the
> > front...)
>
> rwsems currently allow a limited amount of queue-jumping; if a semaphore
> is currently not acquired (it's in transition between two owners), a
> running process can acquire it.
>
> I think it is a bug that we only wake readers at the front of the queue;
> I think we would get better performance if we wake all readers.  ie here:
>
>         /*
> -        * Grant an infinite number of read locks to the readers at the front
> -        * of the queue. We know that woken will be at least 1 as we accounted
> +        * Grant an infinite number of read locks.  We know that woken will
> +        * be at least 1 as we accounted
>          * for above. Note we increment the 'active part' of the count by the
>          * number of readers before waking any processes up.
>          */
>         list_for_each_entry_safe(waiter, tmp, &sem->wait_list, list) {
>                 struct task_struct *tsk;
>
> -               if (waiter->type == RWSEM_WAITING_FOR_WRITE)
> -                       break;
>
> Amir, it seems like you have a good test-case for trying this out ...

Sure, but please explain. Why are you waking up the writers?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [QUESTION] Long read latencies on mixed rw buffered IO
  2019-03-25 15:47     ` Darrick J. Wong
  2019-03-25 16:41       ` Matthew Wilcox
@ 2019-03-25 17:56       ` Amir Goldstein
  2019-03-25 18:02         ` Christoph Hellwig
  1 sibling, 1 reply; 16+ messages in thread
From: Amir Goldstein @ 2019-03-25 17:56 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs, Christoph Hellwig, linux-fsdevel

[...]

> > > i.e. XFS was designed with the intent that buffered writes are
> > > atomic w.r.t. to all other file accesses. That's not to say we can't
> > > change it, just that it has always been different to what linux
> > > native filesystems do. And depending on which set of application
> > > developers you talk to, you'll get different answers as to whether
> > > they want write()s to be atomic....
> > >
> >
> > All right. I am reading the above as no prior objection to adding
> > an XFS mount option and/or preadv2() flag to opt out of this
> > behavior
>
> Reads and writes are not the only thing xfs uses i_rwsem to synchronise.
> Reflink remap uses it to make sure everything's flushed to disk and that
> page cache contents remain clean while the remap is ongoing.  I'm pretty
> sure pnfs uses it for similar reasons when granting and committing write
> leases.
>

OK. If it wasn't clear, I wasn't suggesting to remove i_rwsem around
buffered writes. All fs must take exclusive i_rwsem around write_begin()
and write_end() and for fs that call generic_file_write_iter()  i_rwsem
is held throughout the buffered write.

What I said is that no other fs holds a shared i_rwsem around
generic_file_read_iter(), so this shared lock could perhaps be removed
without implications beyond atomic rd/wr and sync with dio.

FWIW, I ran a sanity -g quick run with 1k and 4k block sizes without
the shared i_rwsem lock around  generic_file_read_iter() and it did not
barf.

> > and align with the reckless behavior of the other local
> > filesystems on Linux.
> > Please correct me if I am wrong and I would like to hear what
> > other people thing w.r.t mount option vs. preadv2() flag.
> > It could also be an open flag, but I prefer not to go down that road...
>
> I don't like the idea of adding a O_BROKENLOCKINGPONIES flag because it
> adds more QA work to support a new (for xfs) violation of the posix spec
> we (supposedly) fulfill.
>

Surely, there is no reason for you to like a larger test matrix, but
if it turns up that there is no magic fix, I don't see an alternative.
Perhaps application opt-in via RWF_NOATOMIC would have the least
impact on test matrix. It makes sense for applications that know
they handle IO concurrency on higher level, or applications that
know they only need page size atomicity.

> If rw_semaphore favors writes too heavily, why not fix that?
>

Sure, let's give that a shot. But allow me to stay skeptical, because
I don't think there is a one-size-fits-all solution.
If application doesn't need >4K atomicity and xfs imposes file-wide
read locks, there is bound to exist a workload where ext4 can guaranty
lower latencies than xfs.

Then again, if we fix rw_semaphore to do a good enough job for my
workload, I may not care about those worst case workloads...

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [QUESTION] Long read latencies on mixed rw buffered IO
  2019-03-25 17:56       ` Amir Goldstein
@ 2019-03-25 18:02         ` Christoph Hellwig
  2019-03-25 18:44           ` Amir Goldstein
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2019-03-25 18:02 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Darrick J. Wong, Dave Chinner, linux-xfs, Christoph Hellwig,
	linux-fsdevel

On Mon, Mar 25, 2019 at 07:56:33PM +0200, Amir Goldstein wrote:
> Sure, let's give that a shot. But allow me to stay skeptical, because
> I don't think there is a one-size-fits-all solution.
> If application doesn't need >4K atomicity and xfs imposes file-wide
> read locks, there is bound to exist a workload where ext4 can guaranty
> lower latencies than xfs.
> 
> Then again, if we fix rw_semaphore to do a good enough job for my
> workload, I may not care about those worst case workloads...

Downgrading these long standing guarantees is simply not an option.

Not quite sure what the I/O pattern of your workload is, but if it
is reads from other regions than you write to you should look into
implementing range locks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [QUESTION] Long read latencies on mixed rw buffered IO
  2019-03-25 17:30         ` Amir Goldstein
@ 2019-03-25 18:22           ` Matthew Wilcox
  2019-03-25 19:18             ` Amir Goldstein
  0 siblings, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2019-03-25 18:22 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Darrick J. Wong, Dave Chinner, linux-xfs, Christoph Hellwig,
	linux-fsdevel

On Mon, Mar 25, 2019 at 07:30:39PM +0200, Amir Goldstein wrote:
> On Mon, Mar 25, 2019 at 6:41 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Mar 25, 2019 at 08:47:31AM -0700, Darrick J. Wong wrote:
> > > Hmmm.... so it looks like the rw_semaphore behavior has shifted over
> > > time, then?
> >
> > Yes.
> >
> > > I thought rwsem was supposed to queue read and write waiters in order,
> > > at least on x86?  Though I suppose that might not matter much since we
> > > can only run one writer at a time vs. waking up all the readers at once.
> > > Now I'm wondering if there ever was a time when the readers all got
> > > batched to the front and starved the writers, but eh I haven't drank
> > > enough coffee to remember things like that. :P
> > >
> > > (I wonder what would happen if rw_semaphore decided to wake up some
> > > number of the readers in the rwsem wait_list, not just the ones at the
> > > front...)
> >
> > rwsems currently allow a limited amount of queue-jumping; if a semaphore
> > is currently not acquired (it's in transition between two owners), a
> > running process can acquire it.
> >
> > I think it is a bug that we only wake readers at the front of the queue;
> > I think we would get better performance if we wake all readers.  ie here:
> >
> >         /*
> > -        * Grant an infinite number of read locks to the readers at the front
> > -        * of the queue. We know that woken will be at least 1 as we accounted
> > +        * Grant an infinite number of read locks.  We know that woken will
> > +        * be at least 1 as we accounted
> >          * for above. Note we increment the 'active part' of the count by the
> >          * number of readers before waking any processes up.
> >          */
> >         list_for_each_entry_safe(waiter, tmp, &sem->wait_list, list) {
> >                 struct task_struct *tsk;
> >
> > -               if (waiter->type == RWSEM_WAITING_FOR_WRITE)
> > -                       break;
> >
> > Amir, it seems like you have a good test-case for trying this out ...
> 
> Sure, but please explain. Why are you waking up the writers?

Oh ... duh.  I didn't mean to wake up the writers.  I meant ...

               if (waiter->type == RWSEM_WAITING_FOR_WRITE)
-			break;
+			continue;


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [QUESTION] Long read latencies on mixed rw buffered IO
  2019-03-25 18:02         ` Christoph Hellwig
@ 2019-03-25 18:44           ` Amir Goldstein
  0 siblings, 0 replies; 16+ messages in thread
From: Amir Goldstein @ 2019-03-25 18:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs, linux-fsdevel

On Mon, Mar 25, 2019 at 8:02 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Mon, Mar 25, 2019 at 07:56:33PM +0200, Amir Goldstein wrote:
> > Sure, let's give that a shot. But allow me to stay skeptical, because
> > I don't think there is a one-size-fits-all solution.
> > If application doesn't need >4K atomicity and xfs imposes file-wide
> > read locks, there is bound to exist a workload where ext4 can guaranty
> > lower latencies than xfs.
> >
> > Then again, if we fix rw_semaphore to do a good enough job for my
> > workload, I may not care about those worst case workloads...
>
> Downgrading these long standing guarantees is simply not an option.

Right. Not without user opt-in.

>
> Not quite sure what the I/O pattern of your workload is, but if it
> is reads from other regions than you write to you should look into
> implementing range locks.

Workload is a VM image file (VDI) accessed with random IO
pattern over SMB. I am still studying the accurate rd/wr mix
patterns that really matter to customers.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [QUESTION] Long read latencies on mixed rw buffered IO
  2019-03-25 18:22           ` Matthew Wilcox
@ 2019-03-25 19:18             ` Amir Goldstein
  2019-03-25 19:40               ` Matthew Wilcox
  0 siblings, 1 reply; 16+ messages in thread
From: Amir Goldstein @ 2019-03-25 19:18 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Darrick J. Wong, Dave Chinner, linux-xfs, Christoph Hellwig,
	linux-fsdevel

On Mon, Mar 25, 2019 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Mar 25, 2019 at 07:30:39PM +0200, Amir Goldstein wrote:
> > On Mon, Mar 25, 2019 at 6:41 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, Mar 25, 2019 at 08:47:31AM -0700, Darrick J. Wong wrote:
> > > > Hmmm.... so it looks like the rw_semaphore behavior has shifted over
> > > > time, then?
> > >
> > > Yes.
> > >
> > > > I thought rwsem was supposed to queue read and write waiters in order,
> > > > at least on x86?  Though I suppose that might not matter much since we
> > > > can only run one writer at a time vs. waking up all the readers at once.
> > > > Now I'm wondering if there ever was a time when the readers all got
> > > > batched to the front and starved the writers, but eh I haven't drank
> > > > enough coffee to remember things like that. :P
> > > >
> > > > (I wonder what would happen if rw_semaphore decided to wake up some
> > > > number of the readers in the rwsem wait_list, not just the ones at the
> > > > front...)
> > >
> > > rwsems currently allow a limited amount of queue-jumping; if a semaphore
> > > is currently not acquired (it's in transition between two owners), a
> > > running process can acquire it.
> > >
> > > I think it is a bug that we only wake readers at the front of the queue;
> > > I think we would get better performance if we wake all readers.  ie here:
> > >
> > >         /*
> > > -        * Grant an infinite number of read locks to the readers at the front
> > > -        * of the queue. We know that woken will be at least 1 as we accounted
> > > +        * Grant an infinite number of read locks.  We know that woken will
> > > +        * be at least 1 as we accounted
> > >          * for above. Note we increment the 'active part' of the count by the
> > >          * number of readers before waking any processes up.
> > >          */
> > >         list_for_each_entry_safe(waiter, tmp, &sem->wait_list, list) {
> > >                 struct task_struct *tsk;
> > >
> > > -               if (waiter->type == RWSEM_WAITING_FOR_WRITE)
> > > -                       break;
> > >
> > > Amir, it seems like you have a good test-case for trying this out ...
> >
> > Sure, but please explain. Why are you waking up the writers?
>
> Oh ... duh.  I didn't mean to wake up the writers.  I meant ...
>
>                if (waiter->type == RWSEM_WAITING_FOR_WRITE)
> -                       break;
> +                       continue;
>

So I have no access to the test machine of former tests right now,
but when running the same filebench randomrw workload
(8 writers, 8 readers) on VM with 2 CPUs and SSD drive, results
are not looking good for this patch:

--- v5.1-rc1 / xfs ---
rand-write1          852404ops    14202ops/s 110.9mb/s      0.6ms/op
[0.01ms - 553.45ms]
rand-read1           26117ops      435ops/s   3.4mb/s     18.4ms/op
[0.04ms - 632.29ms]
61.088: IO Summary: 878521 ops 14636.774 ops/s 435/14202 rd/wr
114.3mb/s   1.1ms/op

--- v5.1-rc1 / xfs + patch above ---
rand-write1          1117998ops    18621ops/s 145.5mb/s      0.4ms/op
[0.01ms - 788.19ms]
rand-read1           7089ops      118ops/s   0.9mb/s     67.4ms/op
[0.03ms - 792.67ms]
61.091: IO Summary: 1125087 ops 18738.961 ops/s 118/18621 rd/wr
146.4mb/s   0.8ms/op

--- v5.1-rc1 / xfs + remove XFS_IOLOCK_SHARED from
xfs_file_buffered_aio_read ---
rand-write1          1025826ops    17091ops/s 133.5mb/s      0.5ms/op
[0.01ms - 909.20ms]
rand-read1           115162ops     1919ops/s  15.0mb/s      4.2ms/op
[0.00ms - 157.46ms]
61.084: IO Summary: 1140988 ops 19009.369 ops/s 1919/17091 rd/wr
148.5mb/s   0.8ms/op

--- v5.1-rc1 / ext4 ---
rand-write1          867926ops    14459ops/s 113.0mb/s      0.6ms/op
[0.01ms - 886.89ms]
rand-read1           121893ops     2031ops/s  15.9mb/s      3.9ms/op
[0.00ms - 149.24ms]
61.102: IO Summary: 989819 ops 16489.132 ops/s 2031/14459 rd/wr
128.8mb/s   1.0ms/op

So rw_semaphore fix is not in the ballpark, not even looking in the
right direction...

Any other ideas to try?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [QUESTION] Long read latencies on mixed rw buffered IO
  2019-03-25 19:18             ` Amir Goldstein
@ 2019-03-25 19:40               ` Matthew Wilcox
  2019-03-25 19:57                 ` Amir Goldstein
  0 siblings, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2019-03-25 19:40 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Darrick J. Wong, Dave Chinner, linux-xfs, Christoph Hellwig,
	linux-fsdevel

On Mon, Mar 25, 2019 at 09:18:51PM +0200, Amir Goldstein wrote:
> On Mon, Mar 25, 2019 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > On Mon, Mar 25, 2019 at 07:30:39PM +0200, Amir Goldstein wrote:
> > > On Mon, Mar 25, 2019 at 6:41 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > I think it is a bug that we only wake readers at the front of the queue;
> > > > I think we would get better performance if we wake all readers.  ie here:
> 
> So I have no access to the test machine of former tests right now,
> but when running the same filebench randomrw workload
> (8 writers, 8 readers) on VM with 2 CPUs and SSD drive, results
> are not looking good for this patch:
> 
> --- v5.1-rc1 / xfs ---
> rand-write1          852404ops    14202ops/s 110.9mb/s      0.6ms/op
> [0.01ms - 553.45ms]
> rand-read1           26117ops      435ops/s   3.4mb/s     18.4ms/op
> [0.04ms - 632.29ms]
> 61.088: IO Summary: 878521 ops 14636.774 ops/s 435/14202 rd/wr
> 114.3mb/s   1.1ms/op
> 
> --- v5.1-rc1 / xfs + patch above ---
> rand-write1          1117998ops    18621ops/s 145.5mb/s      0.4ms/op
> [0.01ms - 788.19ms]
> rand-read1           7089ops      118ops/s   0.9mb/s     67.4ms/op
> [0.03ms - 792.67ms]
> 61.091: IO Summary: 1125087 ops 18738.961 ops/s 118/18621 rd/wr
> 146.4mb/s   0.8ms/op
> 
> --- v5.1-rc1 / xfs + remove XFS_IOLOCK_SHARED from
> xfs_file_buffered_aio_read ---
> rand-write1          1025826ops    17091ops/s 133.5mb/s      0.5ms/op
> [0.01ms - 909.20ms]
> rand-read1           115162ops     1919ops/s  15.0mb/s      4.2ms/op
> [0.00ms - 157.46ms]
> 61.084: IO Summary: 1140988 ops 19009.369 ops/s 1919/17091 rd/wr
> 148.5mb/s   0.8ms/op
> 
> --- v5.1-rc1 / ext4 ---
> rand-write1          867926ops    14459ops/s 113.0mb/s      0.6ms/op
> [0.01ms - 886.89ms]
> rand-read1           121893ops     2031ops/s  15.9mb/s      3.9ms/op
> [0.00ms - 149.24ms]
> 61.102: IO Summary: 989819 ops 16489.132 ops/s 2031/14459 rd/wr
> 128.8mb/s   1.0ms/op
> 
> So rw_semaphore fix is not in the ballpark, not even looking in the
> right direction...
> 
> Any other ideas to try?

Sure!  Maybe the problem is walking the list over and over.  So add new
readers to the front of the list if the head of the list is a reader;
otherwise add them to the tail of the list.

(this won't have quite the same effect as the previous patch because
new readers coming in while the head of the list is a writer will still
get jumbled with new writers, but it should be better than we have now,
assuming the problem is that readers are being delayed behind writers).

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index fbe96341beee..56dbbaea90ee 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -250,8 +250,15 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
 			return sem;
 		}
 		adjustment += RWSEM_WAITING_BIAS;
+		list_add_tail(&waiter.list, &sem->wait_list);
+	} else {
+		struct rwsem_waiter *first = list_first_entry(&sem->wait_list,
+				typeof(*first), list);
+		if (first->type == RWSEM_WAITING_FOR_READ)
+			list_add(&waiter.list, &sem->wait_list);
+		else
+			list_add_tail(&waiter.list, &sem->wait_list);
 	}
-	list_add_tail(&waiter.list, &sem->wait_list);
 
 	/* we're now waiting on the lock, but no longer actively locking */
 	count = atomic_long_add_return(adjustment, &sem->count);

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [QUESTION] Long read latencies on mixed rw buffered IO
  2019-03-25 19:40               ` Matthew Wilcox
@ 2019-03-25 19:57                 ` Amir Goldstein
  2019-03-25 23:48                   ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Amir Goldstein @ 2019-03-25 19:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Darrick J. Wong, Dave Chinner, linux-xfs, Christoph Hellwig,
	linux-fsdevel

On Mon, Mar 25, 2019 at 9:40 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Mar 25, 2019 at 09:18:51PM +0200, Amir Goldstein wrote:
> > On Mon, Mar 25, 2019 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > On Mon, Mar 25, 2019 at 07:30:39PM +0200, Amir Goldstein wrote:
> > > > On Mon, Mar 25, 2019 at 6:41 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > I think it is a bug that we only wake readers at the front of the queue;
> > > > > I think we would get better performance if we wake all readers.  ie here:
> >
> > So I have no access to the test machine of former tests right now,
> > but when running the same filebench randomrw workload
> > (8 writers, 8 readers) on VM with 2 CPUs and SSD drive, results
> > are not looking good for this patch:
> >
> > --- v5.1-rc1 / xfs ---
> > rand-write1          852404ops    14202ops/s 110.9mb/s      0.6ms/op
> > [0.01ms - 553.45ms]
> > rand-read1           26117ops      435ops/s   3.4mb/s     18.4ms/op
> > [0.04ms - 632.29ms]
> > 61.088: IO Summary: 878521 ops 14636.774 ops/s 435/14202 rd/wr
> > 114.3mb/s   1.1ms/op
> >

--- v5.1-rc1 / xfs + patch v2 below ---
rand-write1          852487ops    14175ops/s 110.7mb/s      0.6ms/op
[0.01ms - 755.24ms]
rand-read1           23194ops      386ops/s   3.0mb/s     20.7ms/op
[0.03ms - 755.25ms]
61.187: IO Summary: 875681 ops 14560.980 ops/s 386/14175 rd/wr
113.8mb/s   1.1ms/op

Not as bad as v1. Only a little bit worse than master...
The whole deal with the read/write balance and on SSD, I imagine
the balance really changes. That's why I was skeptical about
one-size-fits all read/write balance.

Keeping an open mind.
Please throw more patches at me.
I will also test them on machine with spindles tomorrow.

Thanks,
Amir.


> > --- v5.1-rc1 / xfs + patch above ---
> > rand-write1          1117998ops    18621ops/s 145.5mb/s      0.4ms/op
> > [0.01ms - 788.19ms]
> > rand-read1           7089ops      118ops/s   0.9mb/s     67.4ms/op
> > [0.03ms - 792.67ms]
> > 61.091: IO Summary: 1125087 ops 18738.961 ops/s 118/18621 rd/wr
> > 146.4mb/s   0.8ms/op
> >
> > --- v5.1-rc1 / xfs + remove XFS_IOLOCK_SHARED from
> > xfs_file_buffered_aio_read ---
> > rand-write1          1025826ops    17091ops/s 133.5mb/s      0.5ms/op
> > [0.01ms - 909.20ms]
> > rand-read1           115162ops     1919ops/s  15.0mb/s      4.2ms/op
> > [0.00ms - 157.46ms]
> > 61.084: IO Summary: 1140988 ops 19009.369 ops/s 1919/17091 rd/wr
> > 148.5mb/s   0.8ms/op
> >
> > --- v5.1-rc1 / ext4 ---
> > rand-write1          867926ops    14459ops/s 113.0mb/s      0.6ms/op
> > [0.01ms - 886.89ms]
> > rand-read1           121893ops     2031ops/s  15.9mb/s      3.9ms/op
> > [0.00ms - 149.24ms]
> > 61.102: IO Summary: 989819 ops 16489.132 ops/s 2031/14459 rd/wr
> > 128.8mb/s   1.0ms/op
> >
> > So rw_semaphore fix is not in the ballpark, not even looking in the
> > right direction...
> >
> > Any other ideas to try?
>
> Sure!  Maybe the problem is walking the list over and over.  So add new
> readers to the front of the list if the head of the list is a reader;
> otherwise add them to the tail of the list.
>
> (this won't have quite the same effect as the previous patch because
> new readers coming in while the head of the list is a writer will still
> get jumbled with new writers, but it should be better than we have now,
> assuming the problem is that readers are being delayed behind writers).
>
> diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
> index fbe96341beee..56dbbaea90ee 100644
> --- a/kernel/locking/rwsem-xadd.c
> +++ b/kernel/locking/rwsem-xadd.c
> @@ -250,8 +250,15 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
>                         return sem;
>                 }
>                 adjustment += RWSEM_WAITING_BIAS;
> +               list_add_tail(&waiter.list, &sem->wait_list);
> +       } else {
> +               struct rwsem_waiter *first = list_first_entry(&sem->wait_list,
> +                               typeof(*first), list);
> +               if (first->type == RWSEM_WAITING_FOR_READ)
> +                       list_add(&waiter.list, &sem->wait_list);
> +               else
> +                       list_add_tail(&waiter.list, &sem->wait_list);
>         }
> -       list_add_tail(&waiter.list, &sem->wait_list);
>
>         /* we're now waiting on the lock, but no longer actively locking */
>         count = atomic_long_add_return(adjustment, &sem->count);

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [QUESTION] Long read latencies on mixed rw buffered IO
  2019-03-25  7:49   ` [QUESTION] Long read latencies on mixed rw buffered IO Amir Goldstein
  2019-03-25 15:47     ` Darrick J. Wong
@ 2019-03-25 23:43     ` Dave Chinner
  2019-03-26  4:36       ` Amir Goldstein
  1 sibling, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2019-03-25 23:43 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: linux-xfs, Christoph Hellwig, linux-fsdevel

On Mon, Mar 25, 2019 at 09:49:33AM +0200, Amir Goldstein wrote:
> On Mon, Mar 25, 2019 at 2:10 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Sun, Mar 24, 2019 at 08:18:10PM +0200, Amir Goldstein wrote:
> > > Hi All,
> > >
> > > Christoph's re-factoring to xfs_ilock() brought up this question,
> > > but AFAICS, current behavior seems to have always been that
> > > way for xfs (?).
> > >
> > > Since commit 6552321831dc ("xfs: remove i_iolock and use
> > > i_rwsem in the VFS inode instead"), xfs_file_buffered_aio_read()
> > > is the only call sites I know of to call generic_file_read_iter() with
> > > i_rwsem read side held.
> >
> > That did not change the locking behaviour of XFS at all.
> 
> I know. Said so above.

Followed by a (?), which seems to indicate you were unsure and
seeking clarification.

> > I'm guessing what we are seeing here is a rwsem starvation problem.
> > The only reason a read would get delayed for mor ethan a few hundred
> > microseconds is ifthere are a continual stream of write holders
> > starving pending reads.
> >
> 
> Yes. That seems to be the case.

And both your fio testing and the testing I've done locally seems to
indicate that this is the case - a constant stream of writers will
block readers for very long periods of time.

> > I don't recall it being this bad historically but rwsems have been
> > severely performance optimised for various corner cases over the
> > past few years and it is extremely likely that behaviour has
> > changed.  Definitely needs looking at, but rwsems are so fragile
> > these days (highly dependent on memory ordering that nobody seems to
> > be able to get right) and impossible to validate as correct that we
> > might just end up having to live with it.
> >
> > FWIW, What kernel did you test this on?
> 
> The results above are from kernel 4.20.7-1.el7.elrepo.x86_64.
> Your assumption about change of behavior of rwsem over the
> years seems to be correct.
> The original report we got from customer was with kernel 4.9.134
> In the original report, long latencies were on writes rather than on reads.
> The following output of filebench randomrw workload summarizes the results:

To summarise: "behaviour used to be constant stream of readers block
writers".

Yay us!

> rand-write1          1128ops       19ops/s   0.1mb/s    340.3ms/op
> [0.03ms - 7933.27ms]
> rand-read1           2954684ops    49238ops/s 384.7mb/s      0.2ms/op
> [0.02ms - 5873.27ms]
> 61.026: IO Summary: 2955812 ops 49257.253 ops/s 49238/19 rd/wr
> 384.8mb/s   0.3ms/op
> 
> With more recent kernel (4.20), the latencies have shifted to the read ops:
> 
> rand-write1          467706ops     7794ops/s  60.9mb/s    0.733ms/op
> [0.004ms - 3899.167ms]
> rand-read1           1970ops       33ops/s   0.3mb/s  164.277ms/op
> [0.017ms - 3905.929ms]
> 61.032: IO Summary: 469676 ops 7826.760 ops/s 33/7794 rd/wr  61.1mb/s 1.419ms/op
> 
> In a workload of 100% reads or 100% writes, there are no extreme
> latencies or neither
> old nor new kernel.

Yup, because there's no reader/writer contention on the lock. The
change of behaviour here is due to the rwsem implementation lacking
proper barrier semantics, and so the imlpementation changes
behaviour according to what microbenchmark goes faster this week.

We never had these problems with Irix rwsems, because:

commit a7724543560e9b6434710abcd5528234f108e204
Author: Doug Doucette <doucette@engr.sgi.com>
Date:   Sat Jun 14 01:44:30 1997 +0000

    Make the iolock be a BARRIER mrlock instead of the default.
    This makes it force new readers to wait if there's a writer
    waiting, insuring some fairness.

Almost all XFS rwsems were configured as MRLOCK_BARRIER rwsems,
which had mostly fair queuing behaviour.

	- if held by read, other readers can grab until pending
	  writer.
	- if held by read and pending writer, reads queue behind
	  writer.
	- if held by read, no pending reads and pending writer, writers get batched
	  with pending writer
	- if held by writer and no pending reader, new writer is
	  queued to run after writer and before any new pending
	  readers.
	- if held by writer and pending reader, new writers get queued
	  behind reads
	- if held by writer and pending reader, readers get batched
	  with pending reader

basically, these barrier semantics prevented new readers from
starving waiting writers, and prevented new writers from starving
pending readers.

/me wonders if he should just implement this as a new type of lock
so that we don't end up having read/write balance change whenever
somebody modifies rwsem behaviour to make some microbenchmark go
faster....



> > I'd be interested in seeing if a fio workload using 16 randrw
> > threads demonstrate the same latency profile (i.e. threads aren't
> > exclusively read- or write- only), or is profile coming about
> > specifically because you have dedicated write-only threads that will
> > continually bash on the IO locks and hence starve readers?
> 
> I guess it depends on the rw mix. With 50% r/w mix run on kernel
> 4.20.7-1.el7, long latencies are mostly on the write ops.

Worst case latencies are roughly equal at ~500ms, distribution is
almost identical - ~50% of writes are not blocking at all, the rest
are waiting behind reads and so showing >=50th percentile latencies
largely similar to the read side.

fio displays this behaviour because it bounds the run length of the
pending write or read IO stream as each thread selects read or write
randomly. Hence each thread will soon issue an IO that will get
blocked and so the exclusion stream is essentially bound in depth by
the random number generator. 

> > > My question is, is the purpose of this lock syncing
> > > dio/buffered io?
> >
> > That's one part of it. The other is POSIX atomic write semantics.
> >
> > https://pubs.opengroup.org/onlinepubs/009695399/functions/read.html
> >
> > "I/O is intended to be atomic to ordinary files and pipes and FIFOs.
> > Atomic means that all the bytes from a single operation that started
> > out together end up together, without interleaving from other I/O
> > operations."
> 
> I see... Are you aware of any applications that depend on this POSIX
> atomic write behavior? It would seem very unlikely that such applications
> exist, if only xfs provides this guaranty on Linux...

Databases like this behaviour very much....

> IMO, it would be best if applications that desire atomic writes semantics
> would opt for it using a preadv2 flag. Or, for backward compat, we can allow
> applications to opt out of atomic write semantics. [CC fsdevel]

You seem to have already decided that it is going to be optional
behaviour in XFS. perhaps you should wait for people to decide
whether we want to do that or not....

> All right. I am reading the above as no prior objection to adding
> an XFS mount option and/or preadv2() flag to opt out of this
> behavior and align with the reckless behavior of the other local
> filesystems on Linux.
> Please correct me if I am wrong and I would like to hear what
> other people thing w.r.t mount option vs. preadv2() flag.
> It could also be an open flag, but I prefer not to go down that road...

"No prior objection" doesn't mean "yes, we are going to do this".

Indeed, if you are doing high performance, concurrent mixed
read/write IO to the same file, then we've already got a solution
for this problem. i.e. using AIO+DIO on XFS will allow
both reads and writes to be issued concurrently and not block the
main thread of execution. That solves all the problems in one go...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [QUESTION] Long read latencies on mixed rw buffered IO
  2019-03-25 19:57                 ` Amir Goldstein
@ 2019-03-25 23:48                   ` Dave Chinner
  2019-03-26  3:44                     ` Amir Goldstein
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2019-03-25 23:48 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Matthew Wilcox, Darrick J. Wong, linux-xfs, Christoph Hellwig,
	linux-fsdevel

On Mon, Mar 25, 2019 at 09:57:46PM +0200, Amir Goldstein wrote:
> On Mon, Mar 25, 2019 at 9:40 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Mar 25, 2019 at 09:18:51PM +0200, Amir Goldstein wrote:
> > > On Mon, Mar 25, 2019 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > On Mon, Mar 25, 2019 at 07:30:39PM +0200, Amir Goldstein wrote:
> > > > > On Mon, Mar 25, 2019 at 6:41 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > I think it is a bug that we only wake readers at the front of the queue;
> > > > > > I think we would get better performance if we wake all readers.  ie here:
> > >
> > > So I have no access to the test machine of former tests right now,
> > > but when running the same filebench randomrw workload
> > > (8 writers, 8 readers) on VM with 2 CPUs and SSD drive, results
> > > are not looking good for this patch:
> > >
> > > --- v5.1-rc1 / xfs ---
> > > rand-write1          852404ops    14202ops/s 110.9mb/s      0.6ms/op
> > > [0.01ms - 553.45ms]
> > > rand-read1           26117ops      435ops/s   3.4mb/s     18.4ms/op
> > > [0.04ms - 632.29ms]
> > > 61.088: IO Summary: 878521 ops 14636.774 ops/s 435/14202 rd/wr
> > > 114.3mb/s   1.1ms/op
> > >
> 
> --- v5.1-rc1 / xfs + patch v2 below ---
> rand-write1          852487ops    14175ops/s 110.7mb/s      0.6ms/op
> [0.01ms - 755.24ms]
> rand-read1           23194ops      386ops/s   3.0mb/s     20.7ms/op
> [0.03ms - 755.25ms]
> 61.187: IO Summary: 875681 ops 14560.980 ops/s 386/14175 rd/wr
> 113.8mb/s   1.1ms/op
> 
> Not as bad as v1. Only a little bit worse than master...
> The whole deal with the read/write balance and on SSD, I imagine
> the balance really changes. That's why I was skeptical about
> one-size-fits all read/write balance.

You're not testing your SSD. You're testing writes into cache vs
reads from disk. There is a massive latency difference in the two
operations, so unless you use O_DSYNC for the writes, you are going
to see this cache-vs-uncached performance unbalance. i.e. unless the
rwsem is truly fair, there is always going to be more writer
access to the lock because they spend less time holding it and so
can put much more pressure on it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [QUESTION] Long read latencies on mixed rw buffered IO
  2019-03-25 23:48                   ` Dave Chinner
@ 2019-03-26  3:44                     ` Amir Goldstein
  2019-03-27  1:29                       ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Amir Goldstein @ 2019-03-26  3:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, Darrick J. Wong, linux-xfs, Christoph Hellwig,
	linux-fsdevel

On Tue, Mar 26, 2019 at 1:48 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Mar 25, 2019 at 09:57:46PM +0200, Amir Goldstein wrote:
> > On Mon, Mar 25, 2019 at 9:40 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, Mar 25, 2019 at 09:18:51PM +0200, Amir Goldstein wrote:
> > > > On Mon, Mar 25, 2019 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > On Mon, Mar 25, 2019 at 07:30:39PM +0200, Amir Goldstein wrote:
> > > > > > On Mon, Mar 25, 2019 at 6:41 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > > I think it is a bug that we only wake readers at the front of the queue;
> > > > > > > I think we would get better performance if we wake all readers.  ie here:
> > > >
> > > > So I have no access to the test machine of former tests right now,
> > > > but when running the same filebench randomrw workload
> > > > (8 writers, 8 readers) on VM with 2 CPUs and SSD drive, results
> > > > are not looking good for this patch:
> > > >
> > > > --- v5.1-rc1 / xfs ---
> > > > rand-write1          852404ops    14202ops/s 110.9mb/s      0.6ms/op
> > > > [0.01ms - 553.45ms]
> > > > rand-read1           26117ops      435ops/s   3.4mb/s     18.4ms/op
> > > > [0.04ms - 632.29ms]
> > > > 61.088: IO Summary: 878521 ops 14636.774 ops/s 435/14202 rd/wr
> > > > 114.3mb/s   1.1ms/op
> > > >
> >
> > --- v5.1-rc1 / xfs + patch v2 below ---
> > rand-write1          852487ops    14175ops/s 110.7mb/s      0.6ms/op
> > [0.01ms - 755.24ms]
> > rand-read1           23194ops      386ops/s   3.0mb/s     20.7ms/op
> > [0.03ms - 755.25ms]
> > 61.187: IO Summary: 875681 ops 14560.980 ops/s 386/14175 rd/wr
> > 113.8mb/s   1.1ms/op
> >
> > Not as bad as v1. Only a little bit worse than master...
> > The whole deal with the read/write balance and on SSD, I imagine
> > the balance really changes. That's why I was skeptical about
> > one-size-fits all read/write balance.
>
> You're not testing your SSD. You're testing writes into cache vs
> reads from disk. There is a massive latency difference in the two
> operations, so unless you use O_DSYNC for the writes, you are going
> to see this cache-vs-uncached performance unbalance. i.e. unless the
> rwsem is truly fair, there is always going to be more writer
> access to the lock because they spend less time holding it and so
> can put much more pressure on it.
>

Yeh, I know. SSD makes the balance better because of faster reads
from disk. Was pointing out that the worse case I am interested in is
on spindles. That said, O_DSYNC certainly does improve the balance
and gives shorter worse case latencies. However, it does not make the
problem go away. i_rwsem taken (even for 4K reads) takes its toll
on write latencies (compared to ext4).

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [QUESTION] Long read latencies on mixed rw buffered IO
  2019-03-25 23:43     ` Dave Chinner
@ 2019-03-26  4:36       ` Amir Goldstein
  0 siblings, 0 replies; 16+ messages in thread
From: Amir Goldstein @ 2019-03-26  4:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Christoph Hellwig, linux-fsdevel

On Tue, Mar 26, 2019 at 1:43 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Mar 25, 2019 at 09:49:33AM +0200, Amir Goldstein wrote:
> > On Mon, Mar 25, 2019 at 2:10 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Sun, Mar 24, 2019 at 08:18:10PM +0200, Amir Goldstein wrote:
> > > > Hi All,
> > > >
> > > > Christoph's re-factoring to xfs_ilock() brought up this question,
> > > > but AFAICS, current behavior seems to have always been that
> > > > way for xfs (?).
> > > >
> > > > Since commit 6552321831dc ("xfs: remove i_iolock and use
> > > > i_rwsem in the VFS inode instead"), xfs_file_buffered_aio_read()
> > > > is the only call sites I know of to call generic_file_read_iter() with
> > > > i_rwsem read side held.
> > >
> > > That did not change the locking behaviour of XFS at all.
> >
> > I know. Said so above.
>
> Followed by a (?), which seems to indicate you were unsure and
> seeking clarification.
>

Ah yes, (?) was referring to "always been", which you already affirmed.

> > > I'm guessing what we are seeing here is a rwsem starvation problem.
> > > The only reason a read would get delayed for mor ethan a few hundred
> > > microseconds is ifthere are a continual stream of write holders
> > > starving pending reads.
> > >
> >
> > Yes. That seems to be the case.
>
> And both your fio testing and the testing I've done locally seems to
> indicate that this is the case - a constant stream of writers will
> block readers for very long periods of time.
>
> > > I don't recall it being this bad historically but rwsems have been
> > > severely performance optimised for various corner cases over the
> > > past few years and it is extremely likely that behaviour has
> > > changed.  Definitely needs looking at, but rwsems are so fragile
> > > these days (highly dependent on memory ordering that nobody seems to
> > > be able to get right) and impossible to validate as correct that we
> > > might just end up having to live with it.
> > >
> > > FWIW, What kernel did you test this on?
> >
> > The results above are from kernel 4.20.7-1.el7.elrepo.x86_64.
> > Your assumption about change of behavior of rwsem over the
> > years seems to be correct.
> > The original report we got from customer was with kernel 4.9.134
> > In the original report, long latencies were on writes rather than on reads.
> > The following output of filebench randomrw workload summarizes the results:
>
> To summarise: "behaviour used to be constant stream of readers block
> writers".
>
> Yay us!
>
> > rand-write1          1128ops       19ops/s   0.1mb/s    340.3ms/op
> > [0.03ms - 7933.27ms]
> > rand-read1           2954684ops    49238ops/s 384.7mb/s      0.2ms/op
> > [0.02ms - 5873.27ms]
> > 61.026: IO Summary: 2955812 ops 49257.253 ops/s 49238/19 rd/wr
> > 384.8mb/s   0.3ms/op
> >
> > With more recent kernel (4.20), the latencies have shifted to the read ops:
> >
> > rand-write1          467706ops     7794ops/s  60.9mb/s    0.733ms/op
> > [0.004ms - 3899.167ms]
> > rand-read1           1970ops       33ops/s   0.3mb/s  164.277ms/op
> > [0.017ms - 3905.929ms]
> > 61.032: IO Summary: 469676 ops 7826.760 ops/s 33/7794 rd/wr  61.1mb/s 1.419ms/op
> >
> > In a workload of 100% reads or 100% writes, there are no extreme
> > latencies or neither
> > old nor new kernel.
>
> Yup, because there's no reader/writer contention on the lock. The
> change of behaviour here is due to the rwsem implementation lacking
> proper barrier semantics, and so the imlpementation changes
> behaviour according to what microbenchmark goes faster this week.
>
> We never had these problems with Irix rwsems, because:
>
> commit a7724543560e9b6434710abcd5528234f108e204
> Author: Doug Doucette <doucette@engr.sgi.com>
> Date:   Sat Jun 14 01:44:30 1997 +0000
>
>     Make the iolock be a BARRIER mrlock instead of the default.
>     This makes it force new readers to wait if there's a writer
>     waiting, insuring some fairness.
>
> Almost all XFS rwsems were configured as MRLOCK_BARRIER rwsems,
> which had mostly fair queuing behaviour.
>
>         - if held by read, other readers can grab until pending
>           writer.
>         - if held by read and pending writer, reads queue behind
>           writer.
>         - if held by read, no pending reads and pending writer, writers get batched
>           with pending writer
>         - if held by writer and no pending reader, new writer is
>           queued to run after writer and before any new pending
>           readers.
>         - if held by writer and pending reader, new writers get queued
>           behind reads
>         - if held by writer and pending reader, readers get batched
>           with pending reader
>
> basically, these barrier semantics prevented new readers from
> starving waiting writers, and prevented new writers from starving
> pending readers.
>
> /me wonders if he should just implement this as a new type of lock
> so that we don't end up having read/write balance change whenever
> somebody modifies rwsem behaviour to make some microbenchmark go
> faster....
>

That sounds interesting. I will be happy to test patches.
Improving the microbenchmark is not my goal though.
Customer reported write latencies with kernel 4.9.
I was able to reproduce the write  latencies using microbenchmark
with kernel 4.9, but the same microbenchmark demonstrated different
behavior on kernel 4.20 (read latencies).
So it looks like I am really after the case of readers blocking
writers, regardless of starvation, which has also been
demonstrated with fio in the workload flavor that you suggested
(mix r/w on same thread).

>
>
> > > I'd be interested in seeing if a fio workload using 16 randrw
> > > threads demonstrate the same latency profile (i.e. threads aren't
> > > exclusively read- or write- only), or is profile coming about
> > > specifically because you have dedicated write-only threads that will
> > > continually bash on the IO locks and hence starve readers?
> >
> > I guess it depends on the rw mix. With 50% r/w mix run on kernel
> > 4.20.7-1.el7, long latencies are mostly on the write ops.
>
> Worst case latencies are roughly equal at ~500ms, distribution is
> almost identical - ~50% of writes are not blocking at all, the rest
> are waiting behind reads and so showing >=50th percentile latencies
> largely similar to the read side.
>
> fio displays this behaviour because it bounds the run length of the
> pending write or read IO stream as each thread selects read or write
> randomly. Hence each thread will soon issue an IO that will get
> blocked and so the exclusion stream is essentially bound in depth by
> the random number generator.
>
> > > > My question is, is the purpose of this lock syncing
> > > > dio/buffered io?
> > >
> > > That's one part of it. The other is POSIX atomic write semantics.
> > >
> > > https://pubs.opengroup.org/onlinepubs/009695399/functions/read.html
> > >
> > > "I/O is intended to be atomic to ordinary files and pipes and FIFOs.
> > > Atomic means that all the bytes from a single operation that started
> > > out together end up together, without interleaving from other I/O
> > > operations."
> >
> > I see... Are you aware of any applications that depend on this POSIX
> > atomic write behavior? It would seem very unlikely that such applications
> > exist, if only xfs provides this guaranty on Linux...
>
> Databases like this behaviour very much....
>

I was under the impression that databases use DIO, where xfs explicitly
leaves concurrent IO synchronization to user space. Is that not the case?

> > IMO, it would be best if applications that desire atomic writes semantics
> > would opt for it using a preadv2 flag. Or, for backward compat, we can allow
> > applications to opt out of atomic write semantics. [CC fsdevel]
>
> You seem to have already decided that it is going to be optional
> behaviour in XFS. perhaps you should wait for people to decide
> whether we want to do that or not....
>
> > All right. I am reading the above as no prior objection to adding
> > an XFS mount option and/or preadv2() flag to opt out of this
> > behavior and align with the reckless behavior of the other local
> > filesystems on Linux.
> > Please correct me if I am wrong and I would like to hear what
> > other people thing w.r.t mount option vs. preadv2() flag.
> > It could also be an open flag, but I prefer not to go down that road...
>
> "No prior objection" doesn't mean "yes, we are going to do this".
>

Of course. The burden of proof is still on me, to show a real life
workload and to explain why problem cannot be solved using APIs
that xfs already provide. I am working via several layers of mediators
to get to the real life workload, so the investigation may take some
time. At this point, I gathered enough inputs to be able to ask this
[Question]. If this discussion ends up also improving fairness on
rw_semaphore, all the better.

> Indeed, if you are doing high performance, concurrent mixed
> read/write IO to the same file, then we've already got a solution
> for this problem. i.e. using AIO+DIO on XFS will allow
> both reads and writes to be issued concurrently and not block the
> main thread of execution. That solves all the problems in one go...
>

Yes, I know. I have looked into using AIO+DIO in samba.
The problem is that samba is a general purpose remote IO dispatcher,
so unlike a regular application, it cannot know a-priori that IO to this
file are going to be aligned (even through in this benchmark they are).

I was under the impression that mixing concurrent buffered IO and DIO
on the same file is not recommended, but from this thread I understand
that it may be an option for xfs (?). In any case, I rather not resort to
special casing xfs in samba, although it may come to this...

I considered doing "opportunistic DIO", that is, starting with O_DIRECT
(always or given certain SMB open hints) and switching to buffered IO
on the first non aligned IO (after all previous DIO have completed).
That may work and should be fine with non-xfs as well.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [QUESTION] Long read latencies on mixed rw buffered IO
  2019-03-26  3:44                     ` Amir Goldstein
@ 2019-03-27  1:29                       ` Dave Chinner
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2019-03-27  1:29 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Matthew Wilcox, Darrick J. Wong, linux-xfs, Christoph Hellwig,
	linux-fsdevel

On Tue, Mar 26, 2019 at 05:44:34AM +0200, Amir Goldstein wrote:
> On Tue, Mar 26, 2019 at 1:48 AM Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Mar 25, 2019 at 09:57:46PM +0200, Amir Goldstein wrote:
> > > Not as bad as v1. Only a little bit worse than master...
> > > The whole deal with the read/write balance and on SSD, I imagine
> > > the balance really changes. That's why I was skeptical about
> > > one-size-fits all read/write balance.
> >
> > You're not testing your SSD. You're testing writes into cache vs
> > reads from disk. There is a massive latency difference in the two
> > operations, so unless you use O_DSYNC for the writes, you are going
> > to see this cache-vs-uncached performance unbalance. i.e. unless the
> > rwsem is truly fair, there is always going to be more writer
> > access to the lock because they spend less time holding it and so
> > can put much more pressure on it.
> >
> 
> Yeh, I know. SSD makes the balance better because of faster reads
> from disk. Was pointing out that the worse case I am interested in is
> on spindles. That said, O_DSYNC certainly does improve the balance
> and gives shorter worse case latencies. However, it does not make the
> problem go away. i_rwsem taken (even for 4K reads) takes its toll
> on write latencies (compared to ext4).

Sure, that's because it's serialising access across the entire file,
not just the page that is being read/written to.

The solution here is to move to concurrent buffered writes
similar to how we do direct IO. We need to track IO ranges for that,
though, so we can serialise unaligned and/or overlapping IOs. We
need fast, efficient range locks for this, though.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2019-03-27  1:29 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@mail.gmail.com>
     [not found] ` <20190325001044.GA23020@dastard>
2019-03-25  7:49   ` [QUESTION] Long read latencies on mixed rw buffered IO Amir Goldstein
2019-03-25 15:47     ` Darrick J. Wong
2019-03-25 16:41       ` Matthew Wilcox
2019-03-25 17:30         ` Amir Goldstein
2019-03-25 18:22           ` Matthew Wilcox
2019-03-25 19:18             ` Amir Goldstein
2019-03-25 19:40               ` Matthew Wilcox
2019-03-25 19:57                 ` Amir Goldstein
2019-03-25 23:48                   ` Dave Chinner
2019-03-26  3:44                     ` Amir Goldstein
2019-03-27  1:29                       ` Dave Chinner
2019-03-25 17:56       ` Amir Goldstein
2019-03-25 18:02         ` Christoph Hellwig
2019-03-25 18:44           ` Amir Goldstein
2019-03-25 23:43     ` Dave Chinner
2019-03-26  4:36       ` Amir Goldstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).