On Thu, Jun 29, 2017 at 5:25 PM, Christoph Hellwig wrote: > This series resurrects the old patches from Milosz to implement > non-blocking buffered reads. Thanks to the non-blocking AIO code from > Goldwyn the implementation becomes pretty much trivial. As that > implementation is in the block tree I would suggest that we merge > these patches through the block tree as well. I've also forward > ported the test Milosz sent for recent xfsprogs to verify it works > properly, but I'll still have to address the review comments for it. > I'll also volunteer to work with Goldwyn to properly document the > RWF_NOWAIT flag in the man page including this change. > I had update patches for the man pages; so I'll check tomorrow if I can dig up the changes and I'll forward them on. > > Here are additional details from the original cover letter from Milosz, > where the flag was still called RWF_NONBLOCK: > > > Background: > > Using a threadpool to emulate non-blocking operations on regular buffered > files is a common pattern today (samba, libuv, etc...) Applications split > the > work between network bound threads (epoll) and IO threadpool. Not every > application can use sendfile syscall (TLS / post-processing). > > This common pattern leads to increased request latency. Latency can be > due to > additional synchronization between the threads or fast (cached data) > request > stuck behind slow request (large / uncached data). > > The preadv2 syscall with RWF_NONBLOCK lets userspace applications bypass > enqueuing operation in the threadpool if it's already available in the > pagecache. > > > Performance numbers (newer Samba): > > https://drive.google.com/file/d/0B3maCn0jCvYncndGbXJKbGlhejQ/ > view?usp=sharing > https://docs.google.com/spreadsheets/d/1GGTivi- > MfZU0doMzomG4XUo9ioWtRvOGQ5FId042L6s/edit?usp=sharing > > > Performance number (older): > > Some perf data generated using fio comparing the posix aio engine to a > version > of the posix AIO engine that attempts to performs "fast" reads before > submitting the operations to the queue. This workflow is on ext4 > partition on > raid0 (test / build-rig.) Simulating our database access patern workload > using > 16kb read accesses. Our database uses a home-spun posix aio like queue > (samba > does the same thing.) > > f1: ~73% rand read over mostly cached data (zipf med-size dataset) > f2: ~18% rand read over mostly un-cached data (uniform large-dataset) > f3: ~9% seq-read over large dataset > > before: > > f1: > bw (KB /s): min= 11, max= 9088, per=0.56%, avg=969.54, stdev=827.99 > lat (msec) : 50=0.01%, 100=1.06%, 250=5.88%, 500=4.08%, 750=12.48% > lat (msec) : 1000=17.27%, 2000=49.86%, >=2000=9.42% > f2: > bw (KB /s): min= 2, max= 1882, per=0.16%, avg=273.28, stdev=220.26 > lat (msec) : 250=5.65%, 500=3.31%, 750=15.64%, 1000=24.59%, > 2000=46.56% > lat (msec) : >=2000=4.33% > f3: > bw (KB /s): min= 0, max=265568, per=99.95%, avg=174575.10, > stdev=34526.89 > lat (usec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.27%, 50=10.82% > lat (usec) : 100=50.34%, 250=5.05%, 500=7.12%, 750=6.60%, 1000=4.55% > lat (msec) : 2=8.73%, 4=3.49%, 10=1.83%, 20=0.89%, 50=0.22% > lat (msec) : 100=0.05%, 250=0.02%, 500=0.01% > total: > READ: io=102365MB, aggrb=174669KB/s, minb=240KB/s, maxb=173599KB/s, > mint=600001msec, maxt=600113msec > > after (with fast read using preadv2 before submit): > > f1: > bw (KB /s): min= 3, max=14897, per=1.28%, avg=2276.69, > stdev=2930.39 > lat (usec) : 2=70.63%, 4=0.01% > lat (msec) : 250=0.20%, 500=2.26%, 750=1.18%, 2000=0.22%, > >=2000=25.53% > f2: > bw (KB /s): min= 2, max= 2362, per=0.14%, avg=249.83, stdev=222.00 > lat (msec) : 250=6.35%, 500=1.78%, 750=9.29%, 1000=20.49%, 2000=52.18% > lat (msec) : >=2000=9.99% > f3: > bw (KB /s): min= 1, max=245448, per=100.00%, avg=177366.50, > stdev=35995.60 > lat (usec) : 2=64.04%, 4=0.01%, 10=0.01%, 20=0.06%, 50=0.43% > lat (usec) : 100=0.20%, 250=1.27%, 500=2.93%, 750=3.93%, 1000=7.35% > lat (msec) : 2=14.27%, 4=2.88%, 10=1.54%, 20=0.81%, 50=0.22% > lat (msec) : 100=0.05%, 250=0.02% > total: > READ: io=103941MB, aggrb=177339KB/s, minb=213KB/s, maxb=176375KB/s, > mint=600020msec, maxt=600178msec > > Interpreting the results you can see total bandwidth stays the same but > overall > request latency is decreased in f1 (random, mostly cached) and f3 > (sequential) > workloads. There is a slight bump in latency for since it's random data > that's > unlikely to be cached but we're always trying "fast read". > > In our application we have starting keeping track of "fast read" > hits/misses > and for files / requests that have a lot hit ratio we don't do "fast > reads" > mostly getting rid of extra latency in the uncached cases. In our real > world > work load we were able to reduce average response time by 20 to 30% > (depends > on amount of IO done by request). > > I've performed other benchmarks and I have no observed any perf > regressions in > any of the normal (old) code paths. > -- Milosz Tanski CTO 16 East 34th Street, 15th floor New York, NY 10016 p: 646-253-9055 e: milosz@adfin.com