* Re: [QUESTION] Long read latencies on mixed rw buffered IO [not found] ` <20190325001044.GA23020@dastard> @ 2019-03-25 7:49 ` Amir Goldstein 2019-03-25 15:47 ` Darrick J. Wong 2019-03-25 23:43 ` Dave Chinner 0 siblings, 2 replies; 16+ messages in thread From: Amir Goldstein @ 2019-03-25 7:49 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs, Christoph Hellwig, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 10799 bytes --] On Mon, Mar 25, 2019 at 2:10 AM Dave Chinner <david@fromorbit.com> wrote: > > On Sun, Mar 24, 2019 at 08:18:10PM +0200, Amir Goldstein wrote: > > Hi All, > > > > Christoph's re-factoring to xfs_ilock() brought up this question, > > but AFAICS, current behavior seems to have always been that > > way for xfs (?). > > > > Since commit 6552321831dc ("xfs: remove i_iolock and use > > i_rwsem in the VFS inode instead"), xfs_file_buffered_aio_read() > > is the only call sites I know of to call generic_file_read_iter() with > > i_rwsem read side held. > > That did not change the locking behaviour of XFS at all. I know. Said so above. > > > This lock is killing performance of multi-threaded buffered > > read/write mixed workload on the same file [1]. > > Attached output of bcc tools [2] script xfsdist and ext4dist > > for latency distribution on the same mixed read/write workload. > > In future, can you just paste the text in line, rather than attach > it as base64 encoded attechments? They are kinda hard to quote. > > ext4 read : > > .... > 4 -> 7 : 2610 > .... > 2048 -> 4095 : 232 > 4096 -> 8191 : 632 > 8192 -> 16383 : 925 > 16384 -> 32767 : 408 > .... > > ext4 write: > > 4 -> 7 : 16732 > 8 -> 15 : 81712 > 16 -> 31 : 121450 > 32 -> 63 : 29638 > > So about 4500 read IOPS, of which >50% are cache hits. > And ~250,000 write iops, all in less than 64 microseconds so > straight into cache. So we are looking at a ~50:1 write to read > ratio here. > > XFS read: > > 0 -> 1 : 8 > 2 -> 3 : 21 > 4 -> 7 : 17 > .... > 8192 -> 16383 : 4 > 16384 -> 32767 : 8 > 32768 -> 65535 : 17 > 65536 -> 131071 : 55 > 131072 -> 262143 : 41 > 262144 -> 524287 : 104 > 524288 -> 1048575 : 101 > 1048576 -> 2097151 : 34 > .... > > XFS Write: > 4 -> 7 : 10087 > 8 -> 15 : 34815 > 16 -> 31 : 32446 > 32 -> 63 : 3888 > 64 -> 127 : 155 > .... > <snip long tail> > > Which shows ~400 read ops (almost no cache hits) It's not because of almost no cache hits. The pattern is the same as in ext4 test. It's because most reads are blocked on writes in slow path, as following sample stack stace from offwaketime bcc tool shows: Tracing blocked time (us) by user + kernel off-CPU and waker stack... waker: filebench 32155 entry_SYSCALL_64_after_hwframe do_syscall_64 sys_pwrite64 vfs_write __vfs_write new_sync_write xfs_file_write_iter xfs_file_buffered_aio_write xfs_iunlock up_write call_rwsem_wake rwsem_wake -- -- finish_task_switch schedule rwsem_down_read_failed call_rwsem_down_read_failed down_read xfs_ilock xfs_file_buffered_aio_read xfs_file_read_iter new_sync_read __vfs_read vfs_read sys_pread64 do_syscall_64 entry_SYSCALL_64_after_hwframe pread target: filebench 32147 109242 > and ~80,000 write > ops with a long latency tail. It's a roughly a 200:1 write to read > ratio. > > I'm guessing what we are seeing here is a rwsem starvation problem. > The only reason a read would get delayed for mor ethan a few hundred > microseconds is ifthere are a continual stream of write holders > starving pending reads. > Yes. That seems to be the case. > I don't recall it being this bad historically but rwsems have been > severely performance optimised for various corner cases over the > past few years and it is extremely likely that behaviour has > changed. Definitely needs looking at, but rwsems are so fragile > these days (highly dependent on memory ordering that nobody seems to > be able to get right) and impossible to validate as correct that we > might just end up having to live with it. > > FWIW, What kernel did you test this on? The results above are from kernel 4.20.7-1.el7.elrepo.x86_64. Your assumption about change of behavior of rwsem over the years seems to be correct. The original report we got from customer was with kernel 4.9.134 In the original report, long latencies were on writes rather than on reads. The following output of filebench randomrw workload summarizes the results: rand-write1 1128ops 19ops/s 0.1mb/s 340.3ms/op [0.03ms - 7933.27ms] rand-read1 2954684ops 49238ops/s 384.7mb/s 0.2ms/op [0.02ms - 5873.27ms] 61.026: IO Summary: 2955812 ops 49257.253 ops/s 49238/19 rd/wr 384.8mb/s 0.3ms/op With more recent kernel (4.20), the latencies have shifted to the read ops: rand-write1 467706ops 7794ops/s 60.9mb/s 0.733ms/op [0.004ms - 3899.167ms] rand-read1 1970ops 33ops/s 0.3mb/s 164.277ms/op [0.017ms - 3905.929ms] 61.032: IO Summary: 469676 ops 7826.760 ops/s 33/7794 rd/wr 61.1mb/s 1.419ms/op In a workload of 100% reads or 100% writes, there are no extreme latencies or neither old nor new kernel. > > http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F > Granted, I wasn't reported a problem is the traditional sense of "what is the cause?" - that I know. I was asking why this change between xfs and other fs. I gathered a lot of traces and analysis from different tools and on different kernels during the investigation, so pouring them at all once would have been counter productive. I'll happily provide any required analysis per demand. > > Compared to ext4, avg. read latency on RAID of spindles > > can be two orders of magnitude higher (>100ms). > > I can provide more performance numbers if needed with fio, > > but they won't surprise anyone considering the extra lock. > > > > This workload simulates a performance issue we are seeing > > on deployed systems. There are other ways for us to work > > around the issue, not using xfs on those systems would be > > one way, but I wanted to figure out the reason for this > > behavior first. > > The workload is 8 threads doing random 8k random reads and > 8 threads doing random writes to a single 5GB file? Yes. To a preallocated file. > > I'd be interested in seeing if a fio workload using 16 randrw > threads demonstrate the same latency profile (i.e. threads aren't > exclusively read- or write- only), or is profile coming about > specifically because you have dedicated write-only threads that will > continually bash on the IO locks and hence starve readers? I guess it depends on the rw mix. With 50% r/w mix run on kernel 4.20.7-1.el7, long latencies are mostly on the write ops. Gist of fio output inline and full output attached (Sorry, for folks who don't like scrolling down): ---XFS--- fio -filename=/xfsfolder/largefile1/00000001/00000001 -direct=0 -iodepth 1 -thread -rw=randrw -rwmixread=50 -ioengine=psync -bs=8k -size=5G -numjobs=16 -runtime=60 -group_reporting -name=randrw_16th_8k ... read: IOPS=264, BW=2117KiB/s (2168kB/s)(124MiB/60045msec) clat (usec): min=148, max=491768, avg=39432.35, stdev=23039.80 ... write: IOPS=268, BW=2149KiB/s (2201kB/s)(126MiB/60045msec) clat (usec): min=4, max=477942, avg=20673.33, stdev=26613.06 ---EXT4--- fio -filename=/ext4folder/largefile1/00000001/00000001 -direct=0 -iodepth 1 -thread -rw=randrw -rwmixread=50 -ioengine=psync -bs=8k -size=5G -numjobs=16 -runtime=60 -group_reporting -name=randrw_16th_8k ... read: IOPS=1409, BW=11.0MiB/s (11.5MB/s)(662MiB/60100msec) clat (usec): min=3, max=1488.6k, avg=11313.66, stdev=22312.95 ... write: IOPS=1418, BW=11.1MiB/s (11.6MB/s)(666MiB/60100msec) clat (usec): min=4, max=7202, avg=19.68, stdev=40.91 > > > My question is, is the purpose of this lock syncing > > dio/buffered io? > > That's one part of it. The other is POSIX atomic write semantics. > > https://pubs.opengroup.org/onlinepubs/009695399/functions/read.html > > "I/O is intended to be atomic to ordinary files and pipes and FIFOs. > Atomic means that all the bytes from a single operation that started > out together end up together, without interleaving from other I/O > operations." > > i.e. that independent read()s should see a write() as a single > atomic change. hence if you do a read() concurrently with a write(), > the read should either run to completion before the write, or the > write run to completion before the read(). > > XFS is the only linux filesystem that provides this behaviour. > > ext4 and other filesytsems tha thave no buffered read side locking > provide serialisation at page level, which means an 8kB read racing > with a 8kB write can return 4k from the new write and 4k from the > yet-to-be-written part of the file. IOWs, the buffered read gets > torn. > I see... Are you aware of any applications that depend on this POSIX atomic write behavior? It would seem very unlikely that such applications exist, if only xfs provides this guaranty on Linux... IMO, it would be best if applications that desire atomic writes semantics would opt for it using a preadv2 flag. Or, for backward compat, we can allow applications to opt out of atomic write semantics. [CC fsdevel] > > If so, was making this behavior optional via mount option > > ever considered for xfs? > > Am I the first one who is asking about this specific workload > > on xfs (Couldn't found anything on Google), or is this a known > > issue/trade off/design choice of xfs? > > It's an original design feature of XFS. See this 1996 usenix paper > on XFS, section 6.2 - Performing File I/O , "Using Multiple > Processes": > > "Currently, when using direct I/O and multiple writers, we > place the burden of serializing writes to the same > region of the file on the application. This differs from > traditional Unix file I/O where file writes are atomic > with respect to other file accesses, and it is one of the > main reasons why we do not yet support multiple > writers using traditional Unix file I/O." > > i.e. XFS was designed with the intent that buffered writes are > atomic w.r.t. to all other file accesses. That's not to say we can't > change it, just that it has always been different to what linux > native filesystems do. And depending on which set of application > developers you talk to, you'll get different answers as to whether > they want write()s to be atomic.... > All right. I am reading the above as no prior objection to adding an XFS mount option and/or preadv2() flag to opt out of this behavior and align with the reckless behavior of the other local filesystems on Linux. Please correct me if I am wrong and I would like to hear what other people thing w.r.t mount option vs. preadv2() flag. It could also be an open flag, but I prefer not to go down that road... Thanks, Amir. [-- Attachment #2: fio.xfs.randrw_16th_8k.txt --] [-- Type: text/plain, Size: 2687 bytes --] randrw_16th_8k: (g=0): rw=randrw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=psync, iodepth=1 ... fio-3.1 Starting 16 threads randrw_16th_8k: (groupid=0, jobs=16): err= 0: pid=5320: Mon Mar 25 03:21:00 2019 read: IOPS=264, BW=2117KiB/s (2168kB/s)(124MiB/60045msec) clat (usec): min=148, max=491768, avg=39432.35, stdev=23039.80 lat (usec): min=148, max=491768, avg=39432.76, stdev=23039.81 clat percentiles (msec): | 1.00th=[ 12], 5.00th=[ 18], 10.00th=[ 21], 20.00th=[ 26], | 30.00th=[ 29], 40.00th=[ 32], 50.00th=[ 35], 60.00th=[ 39], | 70.00th=[ 43], 80.00th=[ 50], 90.00th=[ 64], 95.00th=[ 80], | 99.00th=[ 116], 99.50th=[ 133], 99.90th=[ 247], 99.95th=[ 439], | 99.99th=[ 485] bw ( KiB/s): min= 16, max= 272, per=6.27%, avg=132.82, stdev=48.78, samples=1913 iops : min= 2, max= 34, avg=16.60, stdev= 6.10, samples=1913 write: IOPS=268, BW=2149KiB/s (2201kB/s)(126MiB/60045msec) clat (usec): min=4, max=477942, avg=20673.33, stdev=26613.06 lat (usec): min=4, max=477952, avg=20673.98, stdev=26613.15 clat percentiles (usec): | 1.00th=[ 7], 5.00th=[ 8], 10.00th=[ 8], 20.00th=[ 12], | 30.00th=[ 22], 40.00th=[ 31], 50.00th=[ 17957], 60.00th=[ 26346], | 70.00th=[ 31851], 80.00th=[ 38011], 90.00th=[ 50594], 95.00th=[ 65799], | 99.00th=[103285], 99.50th=[115868], 99.90th=[196084], 99.95th=[270533], | 99.99th=[476054] bw ( KiB/s): min= 16, max= 384, per=6.27%, avg=134.82, stdev=66.25, samples=1914 iops : min= 2, max= 48, avg=16.85, stdev= 8.28, samples=1914 lat (usec) : 10=8.31%, 20=6.12%, 50=9.41%, 100=0.68%, 250=0.02% lat (usec) : 500=0.01%, 750=0.01% lat (msec) : 2=0.01%, 4=0.07%, 10=0.27%, 20=5.26%, 50=55.06% lat (msec) : 100=13.19%, 250=1.52%, 500=0.07% cpu : usr=0.03%, sys=0.25%, ctx=41774, majf=0, minf=0 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwt: total=15893,16131,0, short=0,0,0, dropped=0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=2117KiB/s (2168kB/s), 2117KiB/s-2117KiB/s (2168kB/s-2168kB/s), io=124MiB (130MB), run=60045-60045msec WRITE: bw=2149KiB/s (2201kB/s), 2149KiB/s-2149KiB/s (2201kB/s-2201kB/s), io=126MiB (132MB), run=60045-60045msec Disk stats (read/write): sdc: ios=15435/10391, merge=0/1, ticks=122505/4115, in_queue=126193, util=98.26% [-- Attachment #3: fio.ext4.randrw_16th_8k.txt --] [-- Type: text/plain, Size: 2750 bytes --] randrw_16th_8k: (g=0): rw=randrw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=psync, iodepth=1 ... fio-3.1 Starting 16 threads randrw_16th_8k: (groupid=0, jobs=16): err= 0: pid=5294: Mon Mar 25 03:15:49 2019 read: IOPS=1409, BW=11.0MiB/s (11.5MB/s)(662MiB/60100msec) clat (usec): min=3, max=1488.6k, avg=11313.66, stdev=22312.95 lat (usec): min=3, max=1488.6k, avg=11313.90, stdev=22312.95 clat percentiles (usec): | 1.00th=[ 5], 5.00th=[ 6], 10.00th=[ 8], 20.00th=[ 3392], | 30.00th=[ 5145], 40.00th=[ 6718], 50.00th=[ 8094], 60.00th=[ 9503], | 70.00th=[ 11338], 80.00th=[ 14484], 90.00th=[ 21103], 95.00th=[ 29754], | 99.00th=[ 68682], 99.50th=[109577], 99.90th=[278922], 99.95th=[379585], | 99.99th=[893387] bw ( KiB/s): min= 32, max= 1202, per=6.30%, avg=710.12, stdev=188.47, samples=1909 iops : min= 4, max= 150, avg=88.72, stdev=23.54, samples=1909 write: IOPS=1418, BW=11.1MiB/s (11.6MB/s)(666MiB/60100msec) clat (usec): min=4, max=7202, avg=19.68, stdev=40.91 lat (usec): min=5, max=7203, avg=20.09, stdev=40.97 clat percentiles (usec): | 1.00th=[ 7], 5.00th=[ 10], 10.00th=[ 12], 20.00th=[ 12], | 30.00th=[ 13], 40.00th=[ 15], 50.00th=[ 18], 60.00th=[ 20], | 70.00th=[ 22], 80.00th=[ 26], 90.00th=[ 32], 95.00th=[ 37], | 99.00th=[ 56], 99.50th=[ 64], 99.90th=[ 91], 99.95th=[ 117], | 99.99th=[ 379] bw ( KiB/s): min= 16, max= 1523, per=6.30%, avg=715.24, stdev=247.64, samples=1908 iops : min= 2, max= 190, avg=89.36, stdev=30.93, samples=1908 lat (usec) : 4=0.02%, 10=8.31%, 20=29.16%, 50=17.64%, 100=0.69% lat (usec) : 250=0.04%, 500=0.66%, 750=0.08%, 1000=0.02% lat (msec) : 2=0.55%, 4=4.52%, 10=20.33%, 20=12.50%, 50=4.56% lat (msec) : 100=0.63%, 250=0.21%, 500=0.05%, 750=0.01%, 1000=0.01% lat (msec) : 2000=0.01% cpu : usr=0.09%, sys=0.48%, ctx=75050, majf=0, minf=0 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwt: total=84702,85250,0, short=0,0,0, dropped=0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=11.0MiB/s (11.5MB/s), 11.0MiB/s-11.0MiB/s (11.5MB/s-11.5MB/s), io=662MiB (694MB), run=60100-60100msec WRITE: bw=11.1MiB/s (11.6MB/s), 11.1MiB/s-11.1MiB/s (11.6MB/s-11.6MB/s), io=666MiB (698MB), run=60100-60100msec Disk stats (read/write): sdb: ios=74911/60641, merge=0/13, ticks=950347/34711, in_queue=987372, util=100.00% ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [QUESTION] Long read latencies on mixed rw buffered IO 2019-03-25 7:49 ` [QUESTION] Long read latencies on mixed rw buffered IO Amir Goldstein @ 2019-03-25 15:47 ` Darrick J. Wong 2019-03-25 16:41 ` Matthew Wilcox 2019-03-25 17:56 ` Amir Goldstein 2019-03-25 23:43 ` Dave Chinner 1 sibling, 2 replies; 16+ messages in thread From: Darrick J. Wong @ 2019-03-25 15:47 UTC (permalink / raw) To: Amir Goldstein; +Cc: Dave Chinner, linux-xfs, Christoph Hellwig, linux-fsdevel On Mon, Mar 25, 2019 at 09:49:33AM +0200, Amir Goldstein wrote: > On Mon, Mar 25, 2019 at 2:10 AM Dave Chinner <david@fromorbit.com> wrote: > > > > On Sun, Mar 24, 2019 at 08:18:10PM +0200, Amir Goldstein wrote: > > > Hi All, > > > > > > Christoph's re-factoring to xfs_ilock() brought up this question, > > > but AFAICS, current behavior seems to have always been that > > > way for xfs (?). > > > > > > Since commit 6552321831dc ("xfs: remove i_iolock and use > > > i_rwsem in the VFS inode instead"), xfs_file_buffered_aio_read() > > > is the only call sites I know of to call generic_file_read_iter() with > > > i_rwsem read side held. > > > > That did not change the locking behaviour of XFS at all. > > I know. Said so above. > > > > > > This lock is killing performance of multi-threaded buffered > > > read/write mixed workload on the same file [1]. > > > Attached output of bcc tools [2] script xfsdist and ext4dist > > > for latency distribution on the same mixed read/write workload. > > > > In future, can you just paste the text in line, rather than attach > > it as base64 encoded attechments? They are kinda hard to quote. > > > > ext4 read : > > > > .... > > 4 -> 7 : 2610 > > .... > > 2048 -> 4095 : 232 > > 4096 -> 8191 : 632 > > 8192 -> 16383 : 925 > > 16384 -> 32767 : 408 > > .... > > > > ext4 write: > > > > 4 -> 7 : 16732 > > 8 -> 15 : 81712 > > 16 -> 31 : 121450 > > 32 -> 63 : 29638 > > > > So about 4500 read IOPS, of which >50% are cache hits. > > And ~250,000 write iops, all in less than 64 microseconds so > > straight into cache. So we are looking at a ~50:1 write to read > > ratio here. > > > > XFS read: > > > > 0 -> 1 : 8 > > 2 -> 3 : 21 > > 4 -> 7 : 17 > > .... > > 8192 -> 16383 : 4 > > 16384 -> 32767 : 8 > > 32768 -> 65535 : 17 > > 65536 -> 131071 : 55 > > 131072 -> 262143 : 41 > > 262144 -> 524287 : 104 > > 524288 -> 1048575 : 101 > > 1048576 -> 2097151 : 34 > > .... > > > > XFS Write: > > 4 -> 7 : 10087 > > 8 -> 15 : 34815 > > 16 -> 31 : 32446 > > 32 -> 63 : 3888 > > 64 -> 127 : 155 > > .... > > <snip long tail> > > > > Which shows ~400 read ops (almost no cache hits) > > It's not because of almost no cache hits. > The pattern is the same as in ext4 test. > It's because most reads are blocked on writes in slow path, > as following sample stack stace from offwaketime bcc tool shows: > > Tracing blocked time (us) by user + kernel off-CPU and waker stack... > > waker: filebench 32155 > entry_SYSCALL_64_after_hwframe > do_syscall_64 > sys_pwrite64 > vfs_write > __vfs_write > new_sync_write > xfs_file_write_iter > xfs_file_buffered_aio_write > xfs_iunlock > up_write > call_rwsem_wake > rwsem_wake > -- -- > finish_task_switch > schedule > rwsem_down_read_failed > call_rwsem_down_read_failed > down_read > xfs_ilock > xfs_file_buffered_aio_read > xfs_file_read_iter > new_sync_read > __vfs_read > vfs_read > sys_pread64 > do_syscall_64 > entry_SYSCALL_64_after_hwframe > pread > target: filebench 32147 > 109242 > > > and ~80,000 write > > ops with a long latency tail. It's a roughly a 200:1 write to read > > ratio. > > > > I'm guessing what we are seeing here is a rwsem starvation problem. > > The only reason a read would get delayed for mor ethan a few hundred > > microseconds is ifthere are a continual stream of write holders > > starving pending reads. > > > > Yes. That seems to be the case. > > > I don't recall it being this bad historically but rwsems have been > > severely performance optimised for various corner cases over the > > past few years and it is extremely likely that behaviour has > > changed. Definitely needs looking at, but rwsems are so fragile > > these days (highly dependent on memory ordering that nobody seems to > > be able to get right) and impossible to validate as correct that we > > might just end up having to live with it. > > > > FWIW, What kernel did you test this on? > > The results above are from kernel 4.20.7-1.el7.elrepo.x86_64. > Your assumption about change of behavior of rwsem over the > years seems to be correct. > The original report we got from customer was with kernel 4.9.134 > In the original report, long latencies were on writes rather than on reads. > The following output of filebench randomrw workload summarizes the results: > > rand-write1 1128ops 19ops/s 0.1mb/s 340.3ms/op > [0.03ms - 7933.27ms] > rand-read1 2954684ops 49238ops/s 384.7mb/s 0.2ms/op > [0.02ms - 5873.27ms] > 61.026: IO Summary: 2955812 ops 49257.253 ops/s 49238/19 rd/wr > 384.8mb/s 0.3ms/op > > With more recent kernel (4.20), the latencies have shifted to the read ops: > > rand-write1 467706ops 7794ops/s 60.9mb/s 0.733ms/op > [0.004ms - 3899.167ms] > rand-read1 1970ops 33ops/s 0.3mb/s 164.277ms/op > [0.017ms - 3905.929ms] > 61.032: IO Summary: 469676 ops 7826.760 ops/s 33/7794 rd/wr 61.1mb/s 1.419ms/op > > In a workload of 100% reads or 100% writes, there are no extreme > latencies or neither > old nor new kernel. Hmmm.... so it looks like the rw_semaphore behavior has shifted over time, then? I thought rwsem was supposed to queue read and write waiters in order, at least on x86? Though I suppose that might not matter much since we can only run one writer at a time vs. waking up all the readers at once. Now I'm wondering if there ever was a time when the readers all got batched to the front and starved the writers, but eh I haven't drank enough coffee to remember things like that. :P (I wonder what would happen if rw_semaphore decided to wake up some number of the readers in the rwsem wait_list, not just the ones at the front...) > > > > http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F > > > > Granted, I wasn't reported a problem is the traditional sense of "what is > the cause?" - that I know. I was asking why this change between xfs > and other fs. > I gathered a lot of traces and analysis from different tools and on > different kernels > during the investigation, so pouring them at all once would have been counter > productive. I'll happily provide any required analysis per demand. > > > > Compared to ext4, avg. read latency on RAID of spindles > > > can be two orders of magnitude higher (>100ms). > > > I can provide more performance numbers if needed with fio, > > > but they won't surprise anyone considering the extra lock. > > > > > > This workload simulates a performance issue we are seeing > > > on deployed systems. There are other ways for us to work > > > around the issue, not using xfs on those systems would be > > > one way, but I wanted to figure out the reason for this > > > behavior first. > > > > The workload is 8 threads doing random 8k random reads and > > 8 threads doing random writes to a single 5GB file? > > Yes. To a preallocated file. > > > > > I'd be interested in seeing if a fio workload using 16 randrw > > threads demonstrate the same latency profile (i.e. threads aren't > > exclusively read- or write- only), or is profile coming about > > specifically because you have dedicated write-only threads that will > > continually bash on the IO locks and hence starve readers? > > I guess it depends on the rw mix. With 50% r/w mix run on kernel > 4.20.7-1.el7, long latencies are mostly on the write ops. > > Gist of fio output inline and full output attached (Sorry, for folks who > don't like scrolling down): > > ---XFS--- > fio -filename=/xfsfolder/largefile1/00000001/00000001 -direct=0 > -iodepth 1 -thread -rw=randrw -rwmixread=50 -ioengine=psync -bs=8k > -size=5G -numjobs=16 -runtime=60 -group_reporting -name=randrw_16th_8k > ... > read: IOPS=264, BW=2117KiB/s (2168kB/s)(124MiB/60045msec) > clat (usec): min=148, max=491768, avg=39432.35, stdev=23039.80 > ... > write: IOPS=268, BW=2149KiB/s (2201kB/s)(126MiB/60045msec) > clat (usec): min=4, max=477942, avg=20673.33, stdev=26613.06 > > ---EXT4--- > fio -filename=/ext4folder/largefile1/00000001/00000001 -direct=0 > -iodepth 1 -thread -rw=randrw -rwmixread=50 -ioengine=psync -bs=8k > -size=5G -numjobs=16 -runtime=60 -group_reporting -name=randrw_16th_8k > ... > read: IOPS=1409, BW=11.0MiB/s (11.5MB/s)(662MiB/60100msec) > clat (usec): min=3, max=1488.6k, avg=11313.66, stdev=22312.95 > ... > write: IOPS=1418, BW=11.1MiB/s (11.6MB/s)(666MiB/60100msec) > clat (usec): min=4, max=7202, avg=19.68, stdev=40.91 > > > > > > My question is, is the purpose of this lock syncing > > > dio/buffered io? > > > > That's one part of it. The other is POSIX atomic write semantics. > > > > https://pubs.opengroup.org/onlinepubs/009695399/functions/read.html > > > > "I/O is intended to be atomic to ordinary files and pipes and FIFOs. > > Atomic means that all the bytes from a single operation that started > > out together end up together, without interleaving from other I/O > > operations." > > > > i.e. that independent read()s should see a write() as a single > > atomic change. hence if you do a read() concurrently with a write(), > > the read should either run to completion before the write, or the > > write run to completion before the read(). > > > > XFS is the only linux filesystem that provides this behaviour. > > > > ext4 and other filesytsems tha thave no buffered read side locking > > provide serialisation at page level, which means an 8kB read racing > > with a 8kB write can return 4k from the new write and 4k from the > > yet-to-be-written part of the file. IOWs, the buffered read gets > > torn. > > > > I see... Are you aware of any applications that depend on this POSIX > atomic write behavior? It would seem very unlikely that such applications > exist, if only xfs provides this guaranty on Linux... If we remove the locks that establish the atomic io call behavior, it seems likely to me that we'll find out which applications actually did require the behavior and recommended XFS because of its unique abilities. Only now they'll be mad. > IMO, it would be best if applications that desire atomic writes semantics > would opt for it using a preadv2 flag. Or, for backward compat, we can allow > applications to opt out of atomic write semantics. [CC fsdevel] > > > > If so, was making this behavior optional via mount option > > > ever considered for xfs? > > > Am I the first one who is asking about this specific workload > > > on xfs (Couldn't found anything on Google), or is this a known > > > issue/trade off/design choice of xfs? > > > > It's an original design feature of XFS. See this 1996 usenix paper > > on XFS, section 6.2 - Performing File I/O , "Using Multiple > > Processes": > > > > "Currently, when using direct I/O and multiple writers, we > > place the burden of serializing writes to the same > > region of the file on the application. This differs from > > traditional Unix file I/O where file writes are atomic > > with respect to other file accesses, and it is one of the > > main reasons why we do not yet support multiple > > writers using traditional Unix file I/O." > > > > i.e. XFS was designed with the intent that buffered writes are > > atomic w.r.t. to all other file accesses. That's not to say we can't > > change it, just that it has always been different to what linux > > native filesystems do. And depending on which set of application > > developers you talk to, you'll get different answers as to whether > > they want write()s to be atomic.... > > > > All right. I am reading the above as no prior objection to adding > an XFS mount option and/or preadv2() flag to opt out of this > behavior Reads and writes are not the only thing xfs uses i_rwsem to synchronise. Reflink remap uses it to make sure everything's flushed to disk and that page cache contents remain clean while the remap is ongoing. I'm pretty sure pnfs uses it for similar reasons when granting and committing write leases. > and align with the reckless behavior of the other local > filesystems on Linux. > Please correct me if I am wrong and I would like to hear what > other people thing w.r.t mount option vs. preadv2() flag. > It could also be an open flag, but I prefer not to go down that road... I don't like the idea of adding a O_BROKENLOCKINGPONIES flag because it adds more QA work to support a new (for xfs) violation of the posix spec we (supposedly) fulfill. If rw_semaphore favors writes too heavily, why not fix that? <mutter mutter mutter> --D > Thanks, > Amir. > randrw_16th_8k: (g=0): rw=randrw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=psync, iodepth=1 > ... > fio-3.1 > Starting 16 threads > > randrw_16th_8k: (groupid=0, jobs=16): err= 0: pid=5320: Mon Mar 25 03:21:00 2019 > read: IOPS=264, BW=2117KiB/s (2168kB/s)(124MiB/60045msec) > clat (usec): min=148, max=491768, avg=39432.35, stdev=23039.80 > lat (usec): min=148, max=491768, avg=39432.76, stdev=23039.81 > clat percentiles (msec): > | 1.00th=[ 12], 5.00th=[ 18], 10.00th=[ 21], 20.00th=[ 26], > | 30.00th=[ 29], 40.00th=[ 32], 50.00th=[ 35], 60.00th=[ 39], > | 70.00th=[ 43], 80.00th=[ 50], 90.00th=[ 64], 95.00th=[ 80], > | 99.00th=[ 116], 99.50th=[ 133], 99.90th=[ 247], 99.95th=[ 439], > | 99.99th=[ 485] > bw ( KiB/s): min= 16, max= 272, per=6.27%, avg=132.82, stdev=48.78, samples=1913 > iops : min= 2, max= 34, avg=16.60, stdev= 6.10, samples=1913 > write: IOPS=268, BW=2149KiB/s (2201kB/s)(126MiB/60045msec) > clat (usec): min=4, max=477942, avg=20673.33, stdev=26613.06 > lat (usec): min=4, max=477952, avg=20673.98, stdev=26613.15 > clat percentiles (usec): > | 1.00th=[ 7], 5.00th=[ 8], 10.00th=[ 8], 20.00th=[ 12], > | 30.00th=[ 22], 40.00th=[ 31], 50.00th=[ 17957], 60.00th=[ 26346], > | 70.00th=[ 31851], 80.00th=[ 38011], 90.00th=[ 50594], 95.00th=[ 65799], > | 99.00th=[103285], 99.50th=[115868], 99.90th=[196084], 99.95th=[270533], > | 99.99th=[476054] > bw ( KiB/s): min= 16, max= 384, per=6.27%, avg=134.82, stdev=66.25, samples=1914 > iops : min= 2, max= 48, avg=16.85, stdev= 8.28, samples=1914 > lat (usec) : 10=8.31%, 20=6.12%, 50=9.41%, 100=0.68%, 250=0.02% > lat (usec) : 500=0.01%, 750=0.01% > lat (msec) : 2=0.01%, 4=0.07%, 10=0.27%, 20=5.26%, 50=55.06% > lat (msec) : 100=13.19%, 250=1.52%, 500=0.07% > cpu : usr=0.03%, sys=0.25%, ctx=41774, majf=0, minf=0 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued rwt: total=15893,16131,0, short=0,0,0, dropped=0,0,0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > Run status group 0 (all jobs): > READ: bw=2117KiB/s (2168kB/s), 2117KiB/s-2117KiB/s (2168kB/s-2168kB/s), io=124MiB (130MB), run=60045-60045msec > WRITE: bw=2149KiB/s (2201kB/s), 2149KiB/s-2149KiB/s (2201kB/s-2201kB/s), io=126MiB (132MB), run=60045-60045msec > > Disk stats (read/write): > sdc: ios=15435/10391, merge=0/1, ticks=122505/4115, in_queue=126193, util=98.26% > randrw_16th_8k: (g=0): rw=randrw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=psync, iodepth=1 > ... > fio-3.1 > Starting 16 threads > > randrw_16th_8k: (groupid=0, jobs=16): err= 0: pid=5294: Mon Mar 25 03:15:49 2019 > read: IOPS=1409, BW=11.0MiB/s (11.5MB/s)(662MiB/60100msec) > clat (usec): min=3, max=1488.6k, avg=11313.66, stdev=22312.95 > lat (usec): min=3, max=1488.6k, avg=11313.90, stdev=22312.95 > clat percentiles (usec): > | 1.00th=[ 5], 5.00th=[ 6], 10.00th=[ 8], 20.00th=[ 3392], > | 30.00th=[ 5145], 40.00th=[ 6718], 50.00th=[ 8094], 60.00th=[ 9503], > | 70.00th=[ 11338], 80.00th=[ 14484], 90.00th=[ 21103], 95.00th=[ 29754], > | 99.00th=[ 68682], 99.50th=[109577], 99.90th=[278922], 99.95th=[379585], > | 99.99th=[893387] > bw ( KiB/s): min= 32, max= 1202, per=6.30%, avg=710.12, stdev=188.47, samples=1909 > iops : min= 4, max= 150, avg=88.72, stdev=23.54, samples=1909 > write: IOPS=1418, BW=11.1MiB/s (11.6MB/s)(666MiB/60100msec) > clat (usec): min=4, max=7202, avg=19.68, stdev=40.91 > lat (usec): min=5, max=7203, avg=20.09, stdev=40.97 > clat percentiles (usec): > | 1.00th=[ 7], 5.00th=[ 10], 10.00th=[ 12], 20.00th=[ 12], > | 30.00th=[ 13], 40.00th=[ 15], 50.00th=[ 18], 60.00th=[ 20], > | 70.00th=[ 22], 80.00th=[ 26], 90.00th=[ 32], 95.00th=[ 37], > | 99.00th=[ 56], 99.50th=[ 64], 99.90th=[ 91], 99.95th=[ 117], > | 99.99th=[ 379] > bw ( KiB/s): min= 16, max= 1523, per=6.30%, avg=715.24, stdev=247.64, samples=1908 > iops : min= 2, max= 190, avg=89.36, stdev=30.93, samples=1908 > lat (usec) : 4=0.02%, 10=8.31%, 20=29.16%, 50=17.64%, 100=0.69% > lat (usec) : 250=0.04%, 500=0.66%, 750=0.08%, 1000=0.02% > lat (msec) : 2=0.55%, 4=4.52%, 10=20.33%, 20=12.50%, 50=4.56% > lat (msec) : 100=0.63%, 250=0.21%, 500=0.05%, 750=0.01%, 1000=0.01% > lat (msec) : 2000=0.01% > cpu : usr=0.09%, sys=0.48%, ctx=75050, majf=0, minf=0 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued rwt: total=84702,85250,0, short=0,0,0, dropped=0,0,0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > Run status group 0 (all jobs): > READ: bw=11.0MiB/s (11.5MB/s), 11.0MiB/s-11.0MiB/s (11.5MB/s-11.5MB/s), io=662MiB (694MB), run=60100-60100msec > WRITE: bw=11.1MiB/s (11.6MB/s), 11.1MiB/s-11.1MiB/s (11.6MB/s-11.6MB/s), io=666MiB (698MB), run=60100-60100msec > > Disk stats (read/write): > sdb: ios=74911/60641, merge=0/13, ticks=950347/34711, in_queue=987372, util=100.00% ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [QUESTION] Long read latencies on mixed rw buffered IO 2019-03-25 15:47 ` Darrick J. Wong @ 2019-03-25 16:41 ` Matthew Wilcox 2019-03-25 17:30 ` Amir Goldstein 2019-03-25 17:56 ` Amir Goldstein 1 sibling, 1 reply; 16+ messages in thread From: Matthew Wilcox @ 2019-03-25 16:41 UTC (permalink / raw) To: Darrick J. Wong Cc: Amir Goldstein, Dave Chinner, linux-xfs, Christoph Hellwig, linux-fsdevel On Mon, Mar 25, 2019 at 08:47:31AM -0700, Darrick J. Wong wrote: > Hmmm.... so it looks like the rw_semaphore behavior has shifted over > time, then? Yes. > I thought rwsem was supposed to queue read and write waiters in order, > at least on x86? Though I suppose that might not matter much since we > can only run one writer at a time vs. waking up all the readers at once. > Now I'm wondering if there ever was a time when the readers all got > batched to the front and starved the writers, but eh I haven't drank > enough coffee to remember things like that. :P > > (I wonder what would happen if rw_semaphore decided to wake up some > number of the readers in the rwsem wait_list, not just the ones at the > front...) rwsems currently allow a limited amount of queue-jumping; if a semaphore is currently not acquired (it's in transition between two owners), a running process can acquire it. I think it is a bug that we only wake readers at the front of the queue; I think we would get better performance if we wake all readers. ie here: /* - * Grant an infinite number of read locks to the readers at the front - * of the queue. We know that woken will be at least 1 as we accounted + * Grant an infinite number of read locks. We know that woken will + * be at least 1 as we accounted * for above. Note we increment the 'active part' of the count by the * number of readers before waking any processes up. */ list_for_each_entry_safe(waiter, tmp, &sem->wait_list, list) { struct task_struct *tsk; - if (waiter->type == RWSEM_WAITING_FOR_WRITE) - break; Amir, it seems like you have a good test-case for trying this out ... ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [QUESTION] Long read latencies on mixed rw buffered IO 2019-03-25 16:41 ` Matthew Wilcox @ 2019-03-25 17:30 ` Amir Goldstein 2019-03-25 18:22 ` Matthew Wilcox 0 siblings, 1 reply; 16+ messages in thread From: Amir Goldstein @ 2019-03-25 17:30 UTC (permalink / raw) To: Matthew Wilcox Cc: Darrick J. Wong, Dave Chinner, linux-xfs, Christoph Hellwig, linux-fsdevel On Mon, Mar 25, 2019 at 6:41 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Mar 25, 2019 at 08:47:31AM -0700, Darrick J. Wong wrote: > > Hmmm.... so it looks like the rw_semaphore behavior has shifted over > > time, then? > > Yes. > > > I thought rwsem was supposed to queue read and write waiters in order, > > at least on x86? Though I suppose that might not matter much since we > > can only run one writer at a time vs. waking up all the readers at once. > > Now I'm wondering if there ever was a time when the readers all got > > batched to the front and starved the writers, but eh I haven't drank > > enough coffee to remember things like that. :P > > > > (I wonder what would happen if rw_semaphore decided to wake up some > > number of the readers in the rwsem wait_list, not just the ones at the > > front...) > > rwsems currently allow a limited amount of queue-jumping; if a semaphore > is currently not acquired (it's in transition between two owners), a > running process can acquire it. > > I think it is a bug that we only wake readers at the front of the queue; > I think we would get better performance if we wake all readers. ie here: > > /* > - * Grant an infinite number of read locks to the readers at the front > - * of the queue. We know that woken will be at least 1 as we accounted > + * Grant an infinite number of read locks. We know that woken will > + * be at least 1 as we accounted > * for above. Note we increment the 'active part' of the count by the > * number of readers before waking any processes up. > */ > list_for_each_entry_safe(waiter, tmp, &sem->wait_list, list) { > struct task_struct *tsk; > > - if (waiter->type == RWSEM_WAITING_FOR_WRITE) > - break; > > Amir, it seems like you have a good test-case for trying this out ... Sure, but please explain. Why are you waking up the writers? Thanks, Amir. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [QUESTION] Long read latencies on mixed rw buffered IO 2019-03-25 17:30 ` Amir Goldstein @ 2019-03-25 18:22 ` Matthew Wilcox 2019-03-25 19:18 ` Amir Goldstein 0 siblings, 1 reply; 16+ messages in thread From: Matthew Wilcox @ 2019-03-25 18:22 UTC (permalink / raw) To: Amir Goldstein Cc: Darrick J. Wong, Dave Chinner, linux-xfs, Christoph Hellwig, linux-fsdevel On Mon, Mar 25, 2019 at 07:30:39PM +0200, Amir Goldstein wrote: > On Mon, Mar 25, 2019 at 6:41 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Mon, Mar 25, 2019 at 08:47:31AM -0700, Darrick J. Wong wrote: > > > Hmmm.... so it looks like the rw_semaphore behavior has shifted over > > > time, then? > > > > Yes. > > > > > I thought rwsem was supposed to queue read and write waiters in order, > > > at least on x86? Though I suppose that might not matter much since we > > > can only run one writer at a time vs. waking up all the readers at once. > > > Now I'm wondering if there ever was a time when the readers all got > > > batched to the front and starved the writers, but eh I haven't drank > > > enough coffee to remember things like that. :P > > > > > > (I wonder what would happen if rw_semaphore decided to wake up some > > > number of the readers in the rwsem wait_list, not just the ones at the > > > front...) > > > > rwsems currently allow a limited amount of queue-jumping; if a semaphore > > is currently not acquired (it's in transition between two owners), a > > running process can acquire it. > > > > I think it is a bug that we only wake readers at the front of the queue; > > I think we would get better performance if we wake all readers. ie here: > > > > /* > > - * Grant an infinite number of read locks to the readers at the front > > - * of the queue. We know that woken will be at least 1 as we accounted > > + * Grant an infinite number of read locks. We know that woken will > > + * be at least 1 as we accounted > > * for above. Note we increment the 'active part' of the count by the > > * number of readers before waking any processes up. > > */ > > list_for_each_entry_safe(waiter, tmp, &sem->wait_list, list) { > > struct task_struct *tsk; > > > > - if (waiter->type == RWSEM_WAITING_FOR_WRITE) > > - break; > > > > Amir, it seems like you have a good test-case for trying this out ... > > Sure, but please explain. Why are you waking up the writers? Oh ... duh. I didn't mean to wake up the writers. I meant ... if (waiter->type == RWSEM_WAITING_FOR_WRITE) - break; + continue; ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [QUESTION] Long read latencies on mixed rw buffered IO 2019-03-25 18:22 ` Matthew Wilcox @ 2019-03-25 19:18 ` Amir Goldstein 2019-03-25 19:40 ` Matthew Wilcox 0 siblings, 1 reply; 16+ messages in thread From: Amir Goldstein @ 2019-03-25 19:18 UTC (permalink / raw) To: Matthew Wilcox Cc: Darrick J. Wong, Dave Chinner, linux-xfs, Christoph Hellwig, linux-fsdevel On Mon, Mar 25, 2019 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Mar 25, 2019 at 07:30:39PM +0200, Amir Goldstein wrote: > > On Mon, Mar 25, 2019 at 6:41 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Mon, Mar 25, 2019 at 08:47:31AM -0700, Darrick J. Wong wrote: > > > > Hmmm.... so it looks like the rw_semaphore behavior has shifted over > > > > time, then? > > > > > > Yes. > > > > > > > I thought rwsem was supposed to queue read and write waiters in order, > > > > at least on x86? Though I suppose that might not matter much since we > > > > can only run one writer at a time vs. waking up all the readers at once. > > > > Now I'm wondering if there ever was a time when the readers all got > > > > batched to the front and starved the writers, but eh I haven't drank > > > > enough coffee to remember things like that. :P > > > > > > > > (I wonder what would happen if rw_semaphore decided to wake up some > > > > number of the readers in the rwsem wait_list, not just the ones at the > > > > front...) > > > > > > rwsems currently allow a limited amount of queue-jumping; if a semaphore > > > is currently not acquired (it's in transition between two owners), a > > > running process can acquire it. > > > > > > I think it is a bug that we only wake readers at the front of the queue; > > > I think we would get better performance if we wake all readers. ie here: > > > > > > /* > > > - * Grant an infinite number of read locks to the readers at the front > > > - * of the queue. We know that woken will be at least 1 as we accounted > > > + * Grant an infinite number of read locks. We know that woken will > > > + * be at least 1 as we accounted > > > * for above. Note we increment the 'active part' of the count by the > > > * number of readers before waking any processes up. > > > */ > > > list_for_each_entry_safe(waiter, tmp, &sem->wait_list, list) { > > > struct task_struct *tsk; > > > > > > - if (waiter->type == RWSEM_WAITING_FOR_WRITE) > > > - break; > > > > > > Amir, it seems like you have a good test-case for trying this out ... > > > > Sure, but please explain. Why are you waking up the writers? > > Oh ... duh. I didn't mean to wake up the writers. I meant ... > > if (waiter->type == RWSEM_WAITING_FOR_WRITE) > - break; > + continue; > So I have no access to the test machine of former tests right now, but when running the same filebench randomrw workload (8 writers, 8 readers) on VM with 2 CPUs and SSD drive, results are not looking good for this patch: --- v5.1-rc1 / xfs --- rand-write1 852404ops 14202ops/s 110.9mb/s 0.6ms/op [0.01ms - 553.45ms] rand-read1 26117ops 435ops/s 3.4mb/s 18.4ms/op [0.04ms - 632.29ms] 61.088: IO Summary: 878521 ops 14636.774 ops/s 435/14202 rd/wr 114.3mb/s 1.1ms/op --- v5.1-rc1 / xfs + patch above --- rand-write1 1117998ops 18621ops/s 145.5mb/s 0.4ms/op [0.01ms - 788.19ms] rand-read1 7089ops 118ops/s 0.9mb/s 67.4ms/op [0.03ms - 792.67ms] 61.091: IO Summary: 1125087 ops 18738.961 ops/s 118/18621 rd/wr 146.4mb/s 0.8ms/op --- v5.1-rc1 / xfs + remove XFS_IOLOCK_SHARED from xfs_file_buffered_aio_read --- rand-write1 1025826ops 17091ops/s 133.5mb/s 0.5ms/op [0.01ms - 909.20ms] rand-read1 115162ops 1919ops/s 15.0mb/s 4.2ms/op [0.00ms - 157.46ms] 61.084: IO Summary: 1140988 ops 19009.369 ops/s 1919/17091 rd/wr 148.5mb/s 0.8ms/op --- v5.1-rc1 / ext4 --- rand-write1 867926ops 14459ops/s 113.0mb/s 0.6ms/op [0.01ms - 886.89ms] rand-read1 121893ops 2031ops/s 15.9mb/s 3.9ms/op [0.00ms - 149.24ms] 61.102: IO Summary: 989819 ops 16489.132 ops/s 2031/14459 rd/wr 128.8mb/s 1.0ms/op So rw_semaphore fix is not in the ballpark, not even looking in the right direction... Any other ideas to try? Thanks, Amir. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [QUESTION] Long read latencies on mixed rw buffered IO 2019-03-25 19:18 ` Amir Goldstein @ 2019-03-25 19:40 ` Matthew Wilcox 2019-03-25 19:57 ` Amir Goldstein 0 siblings, 1 reply; 16+ messages in thread From: Matthew Wilcox @ 2019-03-25 19:40 UTC (permalink / raw) To: Amir Goldstein Cc: Darrick J. Wong, Dave Chinner, linux-xfs, Christoph Hellwig, linux-fsdevel On Mon, Mar 25, 2019 at 09:18:51PM +0200, Amir Goldstein wrote: > On Mon, Mar 25, 2019 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Mar 25, 2019 at 07:30:39PM +0200, Amir Goldstein wrote: > > > On Mon, Mar 25, 2019 at 6:41 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > I think it is a bug that we only wake readers at the front of the queue; > > > > I think we would get better performance if we wake all readers. ie here: > > So I have no access to the test machine of former tests right now, > but when running the same filebench randomrw workload > (8 writers, 8 readers) on VM with 2 CPUs and SSD drive, results > are not looking good for this patch: > > --- v5.1-rc1 / xfs --- > rand-write1 852404ops 14202ops/s 110.9mb/s 0.6ms/op > [0.01ms - 553.45ms] > rand-read1 26117ops 435ops/s 3.4mb/s 18.4ms/op > [0.04ms - 632.29ms] > 61.088: IO Summary: 878521 ops 14636.774 ops/s 435/14202 rd/wr > 114.3mb/s 1.1ms/op > > --- v5.1-rc1 / xfs + patch above --- > rand-write1 1117998ops 18621ops/s 145.5mb/s 0.4ms/op > [0.01ms - 788.19ms] > rand-read1 7089ops 118ops/s 0.9mb/s 67.4ms/op > [0.03ms - 792.67ms] > 61.091: IO Summary: 1125087 ops 18738.961 ops/s 118/18621 rd/wr > 146.4mb/s 0.8ms/op > > --- v5.1-rc1 / xfs + remove XFS_IOLOCK_SHARED from > xfs_file_buffered_aio_read --- > rand-write1 1025826ops 17091ops/s 133.5mb/s 0.5ms/op > [0.01ms - 909.20ms] > rand-read1 115162ops 1919ops/s 15.0mb/s 4.2ms/op > [0.00ms - 157.46ms] > 61.084: IO Summary: 1140988 ops 19009.369 ops/s 1919/17091 rd/wr > 148.5mb/s 0.8ms/op > > --- v5.1-rc1 / ext4 --- > rand-write1 867926ops 14459ops/s 113.0mb/s 0.6ms/op > [0.01ms - 886.89ms] > rand-read1 121893ops 2031ops/s 15.9mb/s 3.9ms/op > [0.00ms - 149.24ms] > 61.102: IO Summary: 989819 ops 16489.132 ops/s 2031/14459 rd/wr > 128.8mb/s 1.0ms/op > > So rw_semaphore fix is not in the ballpark, not even looking in the > right direction... > > Any other ideas to try? Sure! Maybe the problem is walking the list over and over. So add new readers to the front of the list if the head of the list is a reader; otherwise add them to the tail of the list. (this won't have quite the same effect as the previous patch because new readers coming in while the head of the list is a writer will still get jumbled with new writers, but it should be better than we have now, assuming the problem is that readers are being delayed behind writers). diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c index fbe96341beee..56dbbaea90ee 100644 --- a/kernel/locking/rwsem-xadd.c +++ b/kernel/locking/rwsem-xadd.c @@ -250,8 +250,15 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state) return sem; } adjustment += RWSEM_WAITING_BIAS; + list_add_tail(&waiter.list, &sem->wait_list); + } else { + struct rwsem_waiter *first = list_first_entry(&sem->wait_list, + typeof(*first), list); + if (first->type == RWSEM_WAITING_FOR_READ) + list_add(&waiter.list, &sem->wait_list); + else + list_add_tail(&waiter.list, &sem->wait_list); } - list_add_tail(&waiter.list, &sem->wait_list); /* we're now waiting on the lock, but no longer actively locking */ count = atomic_long_add_return(adjustment, &sem->count); ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [QUESTION] Long read latencies on mixed rw buffered IO 2019-03-25 19:40 ` Matthew Wilcox @ 2019-03-25 19:57 ` Amir Goldstein 2019-03-25 23:48 ` Dave Chinner 0 siblings, 1 reply; 16+ messages in thread From: Amir Goldstein @ 2019-03-25 19:57 UTC (permalink / raw) To: Matthew Wilcox Cc: Darrick J. Wong, Dave Chinner, linux-xfs, Christoph Hellwig, linux-fsdevel On Mon, Mar 25, 2019 at 9:40 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Mar 25, 2019 at 09:18:51PM +0200, Amir Goldstein wrote: > > On Mon, Mar 25, 2019 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > > On Mon, Mar 25, 2019 at 07:30:39PM +0200, Amir Goldstein wrote: > > > > On Mon, Mar 25, 2019 at 6:41 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > I think it is a bug that we only wake readers at the front of the queue; > > > > > I think we would get better performance if we wake all readers. ie here: > > > > So I have no access to the test machine of former tests right now, > > but when running the same filebench randomrw workload > > (8 writers, 8 readers) on VM with 2 CPUs and SSD drive, results > > are not looking good for this patch: > > > > --- v5.1-rc1 / xfs --- > > rand-write1 852404ops 14202ops/s 110.9mb/s 0.6ms/op > > [0.01ms - 553.45ms] > > rand-read1 26117ops 435ops/s 3.4mb/s 18.4ms/op > > [0.04ms - 632.29ms] > > 61.088: IO Summary: 878521 ops 14636.774 ops/s 435/14202 rd/wr > > 114.3mb/s 1.1ms/op > > --- v5.1-rc1 / xfs + patch v2 below --- rand-write1 852487ops 14175ops/s 110.7mb/s 0.6ms/op [0.01ms - 755.24ms] rand-read1 23194ops 386ops/s 3.0mb/s 20.7ms/op [0.03ms - 755.25ms] 61.187: IO Summary: 875681 ops 14560.980 ops/s 386/14175 rd/wr 113.8mb/s 1.1ms/op Not as bad as v1. Only a little bit worse than master... The whole deal with the read/write balance and on SSD, I imagine the balance really changes. That's why I was skeptical about one-size-fits all read/write balance. Keeping an open mind. Please throw more patches at me. I will also test them on machine with spindles tomorrow. Thanks, Amir. > > --- v5.1-rc1 / xfs + patch above --- > > rand-write1 1117998ops 18621ops/s 145.5mb/s 0.4ms/op > > [0.01ms - 788.19ms] > > rand-read1 7089ops 118ops/s 0.9mb/s 67.4ms/op > > [0.03ms - 792.67ms] > > 61.091: IO Summary: 1125087 ops 18738.961 ops/s 118/18621 rd/wr > > 146.4mb/s 0.8ms/op > > > > --- v5.1-rc1 / xfs + remove XFS_IOLOCK_SHARED from > > xfs_file_buffered_aio_read --- > > rand-write1 1025826ops 17091ops/s 133.5mb/s 0.5ms/op > > [0.01ms - 909.20ms] > > rand-read1 115162ops 1919ops/s 15.0mb/s 4.2ms/op > > [0.00ms - 157.46ms] > > 61.084: IO Summary: 1140988 ops 19009.369 ops/s 1919/17091 rd/wr > > 148.5mb/s 0.8ms/op > > > > --- v5.1-rc1 / ext4 --- > > rand-write1 867926ops 14459ops/s 113.0mb/s 0.6ms/op > > [0.01ms - 886.89ms] > > rand-read1 121893ops 2031ops/s 15.9mb/s 3.9ms/op > > [0.00ms - 149.24ms] > > 61.102: IO Summary: 989819 ops 16489.132 ops/s 2031/14459 rd/wr > > 128.8mb/s 1.0ms/op > > > > So rw_semaphore fix is not in the ballpark, not even looking in the > > right direction... > > > > Any other ideas to try? > > Sure! Maybe the problem is walking the list over and over. So add new > readers to the front of the list if the head of the list is a reader; > otherwise add them to the tail of the list. > > (this won't have quite the same effect as the previous patch because > new readers coming in while the head of the list is a writer will still > get jumbled with new writers, but it should be better than we have now, > assuming the problem is that readers are being delayed behind writers). > > diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c > index fbe96341beee..56dbbaea90ee 100644 > --- a/kernel/locking/rwsem-xadd.c > +++ b/kernel/locking/rwsem-xadd.c > @@ -250,8 +250,15 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state) > return sem; > } > adjustment += RWSEM_WAITING_BIAS; > + list_add_tail(&waiter.list, &sem->wait_list); > + } else { > + struct rwsem_waiter *first = list_first_entry(&sem->wait_list, > + typeof(*first), list); > + if (first->type == RWSEM_WAITING_FOR_READ) > + list_add(&waiter.list, &sem->wait_list); > + else > + list_add_tail(&waiter.list, &sem->wait_list); > } > - list_add_tail(&waiter.list, &sem->wait_list); > > /* we're now waiting on the lock, but no longer actively locking */ > count = atomic_long_add_return(adjustment, &sem->count); ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [QUESTION] Long read latencies on mixed rw buffered IO 2019-03-25 19:57 ` Amir Goldstein @ 2019-03-25 23:48 ` Dave Chinner 2019-03-26 3:44 ` Amir Goldstein 0 siblings, 1 reply; 16+ messages in thread From: Dave Chinner @ 2019-03-25 23:48 UTC (permalink / raw) To: Amir Goldstein Cc: Matthew Wilcox, Darrick J. Wong, linux-xfs, Christoph Hellwig, linux-fsdevel On Mon, Mar 25, 2019 at 09:57:46PM +0200, Amir Goldstein wrote: > On Mon, Mar 25, 2019 at 9:40 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Mon, Mar 25, 2019 at 09:18:51PM +0200, Amir Goldstein wrote: > > > On Mon, Mar 25, 2019 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Mon, Mar 25, 2019 at 07:30:39PM +0200, Amir Goldstein wrote: > > > > > On Mon, Mar 25, 2019 at 6:41 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > I think it is a bug that we only wake readers at the front of the queue; > > > > > > I think we would get better performance if we wake all readers. ie here: > > > > > > So I have no access to the test machine of former tests right now, > > > but when running the same filebench randomrw workload > > > (8 writers, 8 readers) on VM with 2 CPUs and SSD drive, results > > > are not looking good for this patch: > > > > > > --- v5.1-rc1 / xfs --- > > > rand-write1 852404ops 14202ops/s 110.9mb/s 0.6ms/op > > > [0.01ms - 553.45ms] > > > rand-read1 26117ops 435ops/s 3.4mb/s 18.4ms/op > > > [0.04ms - 632.29ms] > > > 61.088: IO Summary: 878521 ops 14636.774 ops/s 435/14202 rd/wr > > > 114.3mb/s 1.1ms/op > > > > > --- v5.1-rc1 / xfs + patch v2 below --- > rand-write1 852487ops 14175ops/s 110.7mb/s 0.6ms/op > [0.01ms - 755.24ms] > rand-read1 23194ops 386ops/s 3.0mb/s 20.7ms/op > [0.03ms - 755.25ms] > 61.187: IO Summary: 875681 ops 14560.980 ops/s 386/14175 rd/wr > 113.8mb/s 1.1ms/op > > Not as bad as v1. Only a little bit worse than master... > The whole deal with the read/write balance and on SSD, I imagine > the balance really changes. That's why I was skeptical about > one-size-fits all read/write balance. You're not testing your SSD. You're testing writes into cache vs reads from disk. There is a massive latency difference in the two operations, so unless you use O_DSYNC for the writes, you are going to see this cache-vs-uncached performance unbalance. i.e. unless the rwsem is truly fair, there is always going to be more writer access to the lock because they spend less time holding it and so can put much more pressure on it. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [QUESTION] Long read latencies on mixed rw buffered IO 2019-03-25 23:48 ` Dave Chinner @ 2019-03-26 3:44 ` Amir Goldstein 2019-03-27 1:29 ` Dave Chinner 0 siblings, 1 reply; 16+ messages in thread From: Amir Goldstein @ 2019-03-26 3:44 UTC (permalink / raw) To: Dave Chinner Cc: Matthew Wilcox, Darrick J. Wong, linux-xfs, Christoph Hellwig, linux-fsdevel On Tue, Mar 26, 2019 at 1:48 AM Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Mar 25, 2019 at 09:57:46PM +0200, Amir Goldstein wrote: > > On Mon, Mar 25, 2019 at 9:40 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Mon, Mar 25, 2019 at 09:18:51PM +0200, Amir Goldstein wrote: > > > > On Mon, Mar 25, 2019 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > On Mon, Mar 25, 2019 at 07:30:39PM +0200, Amir Goldstein wrote: > > > > > > On Mon, Mar 25, 2019 at 6:41 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > I think it is a bug that we only wake readers at the front of the queue; > > > > > > > I think we would get better performance if we wake all readers. ie here: > > > > > > > > So I have no access to the test machine of former tests right now, > > > > but when running the same filebench randomrw workload > > > > (8 writers, 8 readers) on VM with 2 CPUs and SSD drive, results > > > > are not looking good for this patch: > > > > > > > > --- v5.1-rc1 / xfs --- > > > > rand-write1 852404ops 14202ops/s 110.9mb/s 0.6ms/op > > > > [0.01ms - 553.45ms] > > > > rand-read1 26117ops 435ops/s 3.4mb/s 18.4ms/op > > > > [0.04ms - 632.29ms] > > > > 61.088: IO Summary: 878521 ops 14636.774 ops/s 435/14202 rd/wr > > > > 114.3mb/s 1.1ms/op > > > > > > > > --- v5.1-rc1 / xfs + patch v2 below --- > > rand-write1 852487ops 14175ops/s 110.7mb/s 0.6ms/op > > [0.01ms - 755.24ms] > > rand-read1 23194ops 386ops/s 3.0mb/s 20.7ms/op > > [0.03ms - 755.25ms] > > 61.187: IO Summary: 875681 ops 14560.980 ops/s 386/14175 rd/wr > > 113.8mb/s 1.1ms/op > > > > Not as bad as v1. Only a little bit worse than master... > > The whole deal with the read/write balance and on SSD, I imagine > > the balance really changes. That's why I was skeptical about > > one-size-fits all read/write balance. > > You're not testing your SSD. You're testing writes into cache vs > reads from disk. There is a massive latency difference in the two > operations, so unless you use O_DSYNC for the writes, you are going > to see this cache-vs-uncached performance unbalance. i.e. unless the > rwsem is truly fair, there is always going to be more writer > access to the lock because they spend less time holding it and so > can put much more pressure on it. > Yeh, I know. SSD makes the balance better because of faster reads from disk. Was pointing out that the worse case I am interested in is on spindles. That said, O_DSYNC certainly does improve the balance and gives shorter worse case latencies. However, it does not make the problem go away. i_rwsem taken (even for 4K reads) takes its toll on write latencies (compared to ext4). Thanks, Amir. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [QUESTION] Long read latencies on mixed rw buffered IO 2019-03-26 3:44 ` Amir Goldstein @ 2019-03-27 1:29 ` Dave Chinner 0 siblings, 0 replies; 16+ messages in thread From: Dave Chinner @ 2019-03-27 1:29 UTC (permalink / raw) To: Amir Goldstein Cc: Matthew Wilcox, Darrick J. Wong, linux-xfs, Christoph Hellwig, linux-fsdevel On Tue, Mar 26, 2019 at 05:44:34AM +0200, Amir Goldstein wrote: > On Tue, Mar 26, 2019 at 1:48 AM Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Mar 25, 2019 at 09:57:46PM +0200, Amir Goldstein wrote: > > > Not as bad as v1. Only a little bit worse than master... > > > The whole deal with the read/write balance and on SSD, I imagine > > > the balance really changes. That's why I was skeptical about > > > one-size-fits all read/write balance. > > > > You're not testing your SSD. You're testing writes into cache vs > > reads from disk. There is a massive latency difference in the two > > operations, so unless you use O_DSYNC for the writes, you are going > > to see this cache-vs-uncached performance unbalance. i.e. unless the > > rwsem is truly fair, there is always going to be more writer > > access to the lock because they spend less time holding it and so > > can put much more pressure on it. > > > > Yeh, I know. SSD makes the balance better because of faster reads > from disk. Was pointing out that the worse case I am interested in is > on spindles. That said, O_DSYNC certainly does improve the balance > and gives shorter worse case latencies. However, it does not make the > problem go away. i_rwsem taken (even for 4K reads) takes its toll > on write latencies (compared to ext4). Sure, that's because it's serialising access across the entire file, not just the page that is being read/written to. The solution here is to move to concurrent buffered writes similar to how we do direct IO. We need to track IO ranges for that, though, so we can serialise unaligned and/or overlapping IOs. We need fast, efficient range locks for this, though. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [QUESTION] Long read latencies on mixed rw buffered IO 2019-03-25 15:47 ` Darrick J. Wong 2019-03-25 16:41 ` Matthew Wilcox @ 2019-03-25 17:56 ` Amir Goldstein 2019-03-25 18:02 ` Christoph Hellwig 1 sibling, 1 reply; 16+ messages in thread From: Amir Goldstein @ 2019-03-25 17:56 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs, Christoph Hellwig, linux-fsdevel [...] > > > i.e. XFS was designed with the intent that buffered writes are > > > atomic w.r.t. to all other file accesses. That's not to say we can't > > > change it, just that it has always been different to what linux > > > native filesystems do. And depending on which set of application > > > developers you talk to, you'll get different answers as to whether > > > they want write()s to be atomic.... > > > > > > > All right. I am reading the above as no prior objection to adding > > an XFS mount option and/or preadv2() flag to opt out of this > > behavior > > Reads and writes are not the only thing xfs uses i_rwsem to synchronise. > Reflink remap uses it to make sure everything's flushed to disk and that > page cache contents remain clean while the remap is ongoing. I'm pretty > sure pnfs uses it for similar reasons when granting and committing write > leases. > OK. If it wasn't clear, I wasn't suggesting to remove i_rwsem around buffered writes. All fs must take exclusive i_rwsem around write_begin() and write_end() and for fs that call generic_file_write_iter() i_rwsem is held throughout the buffered write. What I said is that no other fs holds a shared i_rwsem around generic_file_read_iter(), so this shared lock could perhaps be removed without implications beyond atomic rd/wr and sync with dio. FWIW, I ran a sanity -g quick run with 1k and 4k block sizes without the shared i_rwsem lock around generic_file_read_iter() and it did not barf. > > and align with the reckless behavior of the other local > > filesystems on Linux. > > Please correct me if I am wrong and I would like to hear what > > other people thing w.r.t mount option vs. preadv2() flag. > > It could also be an open flag, but I prefer not to go down that road... > > I don't like the idea of adding a O_BROKENLOCKINGPONIES flag because it > adds more QA work to support a new (for xfs) violation of the posix spec > we (supposedly) fulfill. > Surely, there is no reason for you to like a larger test matrix, but if it turns up that there is no magic fix, I don't see an alternative. Perhaps application opt-in via RWF_NOATOMIC would have the least impact on test matrix. It makes sense for applications that know they handle IO concurrency on higher level, or applications that know they only need page size atomicity. > If rw_semaphore favors writes too heavily, why not fix that? > Sure, let's give that a shot. But allow me to stay skeptical, because I don't think there is a one-size-fits-all solution. If application doesn't need >4K atomicity and xfs imposes file-wide read locks, there is bound to exist a workload where ext4 can guaranty lower latencies than xfs. Then again, if we fix rw_semaphore to do a good enough job for my workload, I may not care about those worst case workloads... Thanks, Amir. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [QUESTION] Long read latencies on mixed rw buffered IO 2019-03-25 17:56 ` Amir Goldstein @ 2019-03-25 18:02 ` Christoph Hellwig 2019-03-25 18:44 ` Amir Goldstein 0 siblings, 1 reply; 16+ messages in thread From: Christoph Hellwig @ 2019-03-25 18:02 UTC (permalink / raw) To: Amir Goldstein Cc: Darrick J. Wong, Dave Chinner, linux-xfs, Christoph Hellwig, linux-fsdevel On Mon, Mar 25, 2019 at 07:56:33PM +0200, Amir Goldstein wrote: > Sure, let's give that a shot. But allow me to stay skeptical, because > I don't think there is a one-size-fits-all solution. > If application doesn't need >4K atomicity and xfs imposes file-wide > read locks, there is bound to exist a workload where ext4 can guaranty > lower latencies than xfs. > > Then again, if we fix rw_semaphore to do a good enough job for my > workload, I may not care about those worst case workloads... Downgrading these long standing guarantees is simply not an option. Not quite sure what the I/O pattern of your workload is, but if it is reads from other regions than you write to you should look into implementing range locks. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [QUESTION] Long read latencies on mixed rw buffered IO 2019-03-25 18:02 ` Christoph Hellwig @ 2019-03-25 18:44 ` Amir Goldstein 0 siblings, 0 replies; 16+ messages in thread From: Amir Goldstein @ 2019-03-25 18:44 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs, linux-fsdevel On Mon, Mar 25, 2019 at 8:02 PM Christoph Hellwig <hch@lst.de> wrote: > > On Mon, Mar 25, 2019 at 07:56:33PM +0200, Amir Goldstein wrote: > > Sure, let's give that a shot. But allow me to stay skeptical, because > > I don't think there is a one-size-fits-all solution. > > If application doesn't need >4K atomicity and xfs imposes file-wide > > read locks, there is bound to exist a workload where ext4 can guaranty > > lower latencies than xfs. > > > > Then again, if we fix rw_semaphore to do a good enough job for my > > workload, I may not care about those worst case workloads... > > Downgrading these long standing guarantees is simply not an option. Right. Not without user opt-in. > > Not quite sure what the I/O pattern of your workload is, but if it > is reads from other regions than you write to you should look into > implementing range locks. Workload is a VM image file (VDI) accessed with random IO pattern over SMB. I am still studying the accurate rd/wr mix patterns that really matter to customers. Thanks, Amir. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [QUESTION] Long read latencies on mixed rw buffered IO 2019-03-25 7:49 ` [QUESTION] Long read latencies on mixed rw buffered IO Amir Goldstein 2019-03-25 15:47 ` Darrick J. Wong @ 2019-03-25 23:43 ` Dave Chinner 2019-03-26 4:36 ` Amir Goldstein 1 sibling, 1 reply; 16+ messages in thread From: Dave Chinner @ 2019-03-25 23:43 UTC (permalink / raw) To: Amir Goldstein; +Cc: linux-xfs, Christoph Hellwig, linux-fsdevel On Mon, Mar 25, 2019 at 09:49:33AM +0200, Amir Goldstein wrote: > On Mon, Mar 25, 2019 at 2:10 AM Dave Chinner <david@fromorbit.com> wrote: > > > > On Sun, Mar 24, 2019 at 08:18:10PM +0200, Amir Goldstein wrote: > > > Hi All, > > > > > > Christoph's re-factoring to xfs_ilock() brought up this question, > > > but AFAICS, current behavior seems to have always been that > > > way for xfs (?). > > > > > > Since commit 6552321831dc ("xfs: remove i_iolock and use > > > i_rwsem in the VFS inode instead"), xfs_file_buffered_aio_read() > > > is the only call sites I know of to call generic_file_read_iter() with > > > i_rwsem read side held. > > > > That did not change the locking behaviour of XFS at all. > > I know. Said so above. Followed by a (?), which seems to indicate you were unsure and seeking clarification. > > I'm guessing what we are seeing here is a rwsem starvation problem. > > The only reason a read would get delayed for mor ethan a few hundred > > microseconds is ifthere are a continual stream of write holders > > starving pending reads. > > > > Yes. That seems to be the case. And both your fio testing and the testing I've done locally seems to indicate that this is the case - a constant stream of writers will block readers for very long periods of time. > > I don't recall it being this bad historically but rwsems have been > > severely performance optimised for various corner cases over the > > past few years and it is extremely likely that behaviour has > > changed. Definitely needs looking at, but rwsems are so fragile > > these days (highly dependent on memory ordering that nobody seems to > > be able to get right) and impossible to validate as correct that we > > might just end up having to live with it. > > > > FWIW, What kernel did you test this on? > > The results above are from kernel 4.20.7-1.el7.elrepo.x86_64. > Your assumption about change of behavior of rwsem over the > years seems to be correct. > The original report we got from customer was with kernel 4.9.134 > In the original report, long latencies were on writes rather than on reads. > The following output of filebench randomrw workload summarizes the results: To summarise: "behaviour used to be constant stream of readers block writers". Yay us! > rand-write1 1128ops 19ops/s 0.1mb/s 340.3ms/op > [0.03ms - 7933.27ms] > rand-read1 2954684ops 49238ops/s 384.7mb/s 0.2ms/op > [0.02ms - 5873.27ms] > 61.026: IO Summary: 2955812 ops 49257.253 ops/s 49238/19 rd/wr > 384.8mb/s 0.3ms/op > > With more recent kernel (4.20), the latencies have shifted to the read ops: > > rand-write1 467706ops 7794ops/s 60.9mb/s 0.733ms/op > [0.004ms - 3899.167ms] > rand-read1 1970ops 33ops/s 0.3mb/s 164.277ms/op > [0.017ms - 3905.929ms] > 61.032: IO Summary: 469676 ops 7826.760 ops/s 33/7794 rd/wr 61.1mb/s 1.419ms/op > > In a workload of 100% reads or 100% writes, there are no extreme > latencies or neither > old nor new kernel. Yup, because there's no reader/writer contention on the lock. The change of behaviour here is due to the rwsem implementation lacking proper barrier semantics, and so the imlpementation changes behaviour according to what microbenchmark goes faster this week. We never had these problems with Irix rwsems, because: commit a7724543560e9b6434710abcd5528234f108e204 Author: Doug Doucette <doucette@engr.sgi.com> Date: Sat Jun 14 01:44:30 1997 +0000 Make the iolock be a BARRIER mrlock instead of the default. This makes it force new readers to wait if there's a writer waiting, insuring some fairness. Almost all XFS rwsems were configured as MRLOCK_BARRIER rwsems, which had mostly fair queuing behaviour. - if held by read, other readers can grab until pending writer. - if held by read and pending writer, reads queue behind writer. - if held by read, no pending reads and pending writer, writers get batched with pending writer - if held by writer and no pending reader, new writer is queued to run after writer and before any new pending readers. - if held by writer and pending reader, new writers get queued behind reads - if held by writer and pending reader, readers get batched with pending reader basically, these barrier semantics prevented new readers from starving waiting writers, and prevented new writers from starving pending readers. /me wonders if he should just implement this as a new type of lock so that we don't end up having read/write balance change whenever somebody modifies rwsem behaviour to make some microbenchmark go faster.... > > I'd be interested in seeing if a fio workload using 16 randrw > > threads demonstrate the same latency profile (i.e. threads aren't > > exclusively read- or write- only), or is profile coming about > > specifically because you have dedicated write-only threads that will > > continually bash on the IO locks and hence starve readers? > > I guess it depends on the rw mix. With 50% r/w mix run on kernel > 4.20.7-1.el7, long latencies are mostly on the write ops. Worst case latencies are roughly equal at ~500ms, distribution is almost identical - ~50% of writes are not blocking at all, the rest are waiting behind reads and so showing >=50th percentile latencies largely similar to the read side. fio displays this behaviour because it bounds the run length of the pending write or read IO stream as each thread selects read or write randomly. Hence each thread will soon issue an IO that will get blocked and so the exclusion stream is essentially bound in depth by the random number generator. > > > My question is, is the purpose of this lock syncing > > > dio/buffered io? > > > > That's one part of it. The other is POSIX atomic write semantics. > > > > https://pubs.opengroup.org/onlinepubs/009695399/functions/read.html > > > > "I/O is intended to be atomic to ordinary files and pipes and FIFOs. > > Atomic means that all the bytes from a single operation that started > > out together end up together, without interleaving from other I/O > > operations." > > I see... Are you aware of any applications that depend on this POSIX > atomic write behavior? It would seem very unlikely that such applications > exist, if only xfs provides this guaranty on Linux... Databases like this behaviour very much.... > IMO, it would be best if applications that desire atomic writes semantics > would opt for it using a preadv2 flag. Or, for backward compat, we can allow > applications to opt out of atomic write semantics. [CC fsdevel] You seem to have already decided that it is going to be optional behaviour in XFS. perhaps you should wait for people to decide whether we want to do that or not.... > All right. I am reading the above as no prior objection to adding > an XFS mount option and/or preadv2() flag to opt out of this > behavior and align with the reckless behavior of the other local > filesystems on Linux. > Please correct me if I am wrong and I would like to hear what > other people thing w.r.t mount option vs. preadv2() flag. > It could also be an open flag, but I prefer not to go down that road... "No prior objection" doesn't mean "yes, we are going to do this". Indeed, if you are doing high performance, concurrent mixed read/write IO to the same file, then we've already got a solution for this problem. i.e. using AIO+DIO on XFS will allow both reads and writes to be issued concurrently and not block the main thread of execution. That solves all the problems in one go... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [QUESTION] Long read latencies on mixed rw buffered IO 2019-03-25 23:43 ` Dave Chinner @ 2019-03-26 4:36 ` Amir Goldstein 0 siblings, 0 replies; 16+ messages in thread From: Amir Goldstein @ 2019-03-26 4:36 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs, Christoph Hellwig, linux-fsdevel On Tue, Mar 26, 2019 at 1:43 AM Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Mar 25, 2019 at 09:49:33AM +0200, Amir Goldstein wrote: > > On Mon, Mar 25, 2019 at 2:10 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Sun, Mar 24, 2019 at 08:18:10PM +0200, Amir Goldstein wrote: > > > > Hi All, > > > > > > > > Christoph's re-factoring to xfs_ilock() brought up this question, > > > > but AFAICS, current behavior seems to have always been that > > > > way for xfs (?). > > > > > > > > Since commit 6552321831dc ("xfs: remove i_iolock and use > > > > i_rwsem in the VFS inode instead"), xfs_file_buffered_aio_read() > > > > is the only call sites I know of to call generic_file_read_iter() with > > > > i_rwsem read side held. > > > > > > That did not change the locking behaviour of XFS at all. > > > > I know. Said so above. > > Followed by a (?), which seems to indicate you were unsure and > seeking clarification. > Ah yes, (?) was referring to "always been", which you already affirmed. > > > I'm guessing what we are seeing here is a rwsem starvation problem. > > > The only reason a read would get delayed for mor ethan a few hundred > > > microseconds is ifthere are a continual stream of write holders > > > starving pending reads. > > > > > > > Yes. That seems to be the case. > > And both your fio testing and the testing I've done locally seems to > indicate that this is the case - a constant stream of writers will > block readers for very long periods of time. > > > > I don't recall it being this bad historically but rwsems have been > > > severely performance optimised for various corner cases over the > > > past few years and it is extremely likely that behaviour has > > > changed. Definitely needs looking at, but rwsems are so fragile > > > these days (highly dependent on memory ordering that nobody seems to > > > be able to get right) and impossible to validate as correct that we > > > might just end up having to live with it. > > > > > > FWIW, What kernel did you test this on? > > > > The results above are from kernel 4.20.7-1.el7.elrepo.x86_64. > > Your assumption about change of behavior of rwsem over the > > years seems to be correct. > > The original report we got from customer was with kernel 4.9.134 > > In the original report, long latencies were on writes rather than on reads. > > The following output of filebench randomrw workload summarizes the results: > > To summarise: "behaviour used to be constant stream of readers block > writers". > > Yay us! > > > rand-write1 1128ops 19ops/s 0.1mb/s 340.3ms/op > > [0.03ms - 7933.27ms] > > rand-read1 2954684ops 49238ops/s 384.7mb/s 0.2ms/op > > [0.02ms - 5873.27ms] > > 61.026: IO Summary: 2955812 ops 49257.253 ops/s 49238/19 rd/wr > > 384.8mb/s 0.3ms/op > > > > With more recent kernel (4.20), the latencies have shifted to the read ops: > > > > rand-write1 467706ops 7794ops/s 60.9mb/s 0.733ms/op > > [0.004ms - 3899.167ms] > > rand-read1 1970ops 33ops/s 0.3mb/s 164.277ms/op > > [0.017ms - 3905.929ms] > > 61.032: IO Summary: 469676 ops 7826.760 ops/s 33/7794 rd/wr 61.1mb/s 1.419ms/op > > > > In a workload of 100% reads or 100% writes, there are no extreme > > latencies or neither > > old nor new kernel. > > Yup, because there's no reader/writer contention on the lock. The > change of behaviour here is due to the rwsem implementation lacking > proper barrier semantics, and so the imlpementation changes > behaviour according to what microbenchmark goes faster this week. > > We never had these problems with Irix rwsems, because: > > commit a7724543560e9b6434710abcd5528234f108e204 > Author: Doug Doucette <doucette@engr.sgi.com> > Date: Sat Jun 14 01:44:30 1997 +0000 > > Make the iolock be a BARRIER mrlock instead of the default. > This makes it force new readers to wait if there's a writer > waiting, insuring some fairness. > > Almost all XFS rwsems were configured as MRLOCK_BARRIER rwsems, > which had mostly fair queuing behaviour. > > - if held by read, other readers can grab until pending > writer. > - if held by read and pending writer, reads queue behind > writer. > - if held by read, no pending reads and pending writer, writers get batched > with pending writer > - if held by writer and no pending reader, new writer is > queued to run after writer and before any new pending > readers. > - if held by writer and pending reader, new writers get queued > behind reads > - if held by writer and pending reader, readers get batched > with pending reader > > basically, these barrier semantics prevented new readers from > starving waiting writers, and prevented new writers from starving > pending readers. > > /me wonders if he should just implement this as a new type of lock > so that we don't end up having read/write balance change whenever > somebody modifies rwsem behaviour to make some microbenchmark go > faster.... > That sounds interesting. I will be happy to test patches. Improving the microbenchmark is not my goal though. Customer reported write latencies with kernel 4.9. I was able to reproduce the write latencies using microbenchmark with kernel 4.9, but the same microbenchmark demonstrated different behavior on kernel 4.20 (read latencies). So it looks like I am really after the case of readers blocking writers, regardless of starvation, which has also been demonstrated with fio in the workload flavor that you suggested (mix r/w on same thread). > > > > > I'd be interested in seeing if a fio workload using 16 randrw > > > threads demonstrate the same latency profile (i.e. threads aren't > > > exclusively read- or write- only), or is profile coming about > > > specifically because you have dedicated write-only threads that will > > > continually bash on the IO locks and hence starve readers? > > > > I guess it depends on the rw mix. With 50% r/w mix run on kernel > > 4.20.7-1.el7, long latencies are mostly on the write ops. > > Worst case latencies are roughly equal at ~500ms, distribution is > almost identical - ~50% of writes are not blocking at all, the rest > are waiting behind reads and so showing >=50th percentile latencies > largely similar to the read side. > > fio displays this behaviour because it bounds the run length of the > pending write or read IO stream as each thread selects read or write > randomly. Hence each thread will soon issue an IO that will get > blocked and so the exclusion stream is essentially bound in depth by > the random number generator. > > > > > My question is, is the purpose of this lock syncing > > > > dio/buffered io? > > > > > > That's one part of it. The other is POSIX atomic write semantics. > > > > > > https://pubs.opengroup.org/onlinepubs/009695399/functions/read.html > > > > > > "I/O is intended to be atomic to ordinary files and pipes and FIFOs. > > > Atomic means that all the bytes from a single operation that started > > > out together end up together, without interleaving from other I/O > > > operations." > > > > I see... Are you aware of any applications that depend on this POSIX > > atomic write behavior? It would seem very unlikely that such applications > > exist, if only xfs provides this guaranty on Linux... > > Databases like this behaviour very much.... > I was under the impression that databases use DIO, where xfs explicitly leaves concurrent IO synchronization to user space. Is that not the case? > > IMO, it would be best if applications that desire atomic writes semantics > > would opt for it using a preadv2 flag. Or, for backward compat, we can allow > > applications to opt out of atomic write semantics. [CC fsdevel] > > You seem to have already decided that it is going to be optional > behaviour in XFS. perhaps you should wait for people to decide > whether we want to do that or not.... > > > All right. I am reading the above as no prior objection to adding > > an XFS mount option and/or preadv2() flag to opt out of this > > behavior and align with the reckless behavior of the other local > > filesystems on Linux. > > Please correct me if I am wrong and I would like to hear what > > other people thing w.r.t mount option vs. preadv2() flag. > > It could also be an open flag, but I prefer not to go down that road... > > "No prior objection" doesn't mean "yes, we are going to do this". > Of course. The burden of proof is still on me, to show a real life workload and to explain why problem cannot be solved using APIs that xfs already provide. I am working via several layers of mediators to get to the real life workload, so the investigation may take some time. At this point, I gathered enough inputs to be able to ask this [Question]. If this discussion ends up also improving fairness on rw_semaphore, all the better. > Indeed, if you are doing high performance, concurrent mixed > read/write IO to the same file, then we've already got a solution > for this problem. i.e. using AIO+DIO on XFS will allow > both reads and writes to be issued concurrently and not block the > main thread of execution. That solves all the problems in one go... > Yes, I know. I have looked into using AIO+DIO in samba. The problem is that samba is a general purpose remote IO dispatcher, so unlike a regular application, it cannot know a-priori that IO to this file are going to be aligned (even through in this benchmark they are). I was under the impression that mixing concurrent buffered IO and DIO on the same file is not recommended, but from this thread I understand that it may be an option for xfs (?). In any case, I rather not resort to special casing xfs in samba, although it may come to this... I considered doing "opportunistic DIO", that is, starting with O_DIRECT (always or given certain SMB open hints) and switching to buffered IO on the first non aligned IO (after all previous DIO have completed). That may work and should be fine with non-xfs as well. Thanks, Amir. ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2019-03-27 1:29 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@mail.gmail.com> [not found] ` <20190325001044.GA23020@dastard> 2019-03-25 7:49 ` [QUESTION] Long read latencies on mixed rw buffered IO Amir Goldstein 2019-03-25 15:47 ` Darrick J. Wong 2019-03-25 16:41 ` Matthew Wilcox 2019-03-25 17:30 ` Amir Goldstein 2019-03-25 18:22 ` Matthew Wilcox 2019-03-25 19:18 ` Amir Goldstein 2019-03-25 19:40 ` Matthew Wilcox 2019-03-25 19:57 ` Amir Goldstein 2019-03-25 23:48 ` Dave Chinner 2019-03-26 3:44 ` Amir Goldstein 2019-03-27 1:29 ` Dave Chinner 2019-03-25 17:56 ` Amir Goldstein 2019-03-25 18:02 ` Christoph Hellwig 2019-03-25 18:44 ` Amir Goldstein 2019-03-25 23:43 ` Dave Chinner 2019-03-26 4:36 ` Amir Goldstein
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).