Re: non-blockling buffered reads

From: Milosz Tanski <milosz@adfin.com>
To: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>, Jens Axboe <axboe@kernel.dk>,
	Goldwyn Rodrigues <rgoldwyn@suse.com>,
	Mel Gorman <mgorman@suse.de>,
	Volker Lendecke <Volker.Lendecke@sernet.de>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	linux-block@vger.kernel.org
Subject: Re: non-blockling buffered reads
Date: Thu, 29 Jun 2017 21:12:19 -0400	[thread overview]
Message-ID: <CANP1eJG6OUtZ_Vcu0A53SZYKfwSY8n5nVfvXnokPMXzpbdLQ1A@mail.gmail.com> (raw)
In-Reply-To: <20170629212503.15110-1-hch@lst.de>

On Thu, Jun 29, 2017 at 5:25 PM, Christoph Hellwig <hch@lst.de> wrote:
>
> This series resurrects the old patches from Milosz to implement
> non-blocking buffered reads.  Thanks to the non-blocking AIO code from
> Goldwyn the implementation becomes pretty much trivial.  As that
> implementation is in the block tree I would suggest that we merge
> these patches through the block tree as well.  I've also forward
> ported the test Milosz sent for recent xfsprogs to verify it works
> properly, but I'll still have to address the review comments for it.
> I'll also volunteer to work with Goldwyn to properly document the
> RWF_NOWAIT flag in the man page including this change.

I had update patches for the man pages; so I'll check tomorrow if I
can dig up the changes and I'll forward them on.

>
> Here are additional details from the original cover letter from Milosz,
> where the flag was still called RWF_NONBLOCK:
>
>
> Background:
>
>  Using a threadpool to emulate non-blocking operations on regular buffered
>  files is a common pattern today (samba, libuv, etc...) Applications split the
>  work between network bound threads (epoll) and IO threadpool. Not every
>  application can use sendfile syscall (TLS / post-processing).
>
>  This common pattern leads to increased request latency. Latency can be due to
>  additional synchronization between the threads or fast (cached data) request
>  stuck behind slow request (large / uncached data).
>
>  The preadv2 syscall with RWF_NONBLOCK lets userspace applications bypass
>  enqueuing operation in the threadpool if it's already available in the
>  pagecache.
>
>
> Performance numbers (newer Samba):
>
>  https://drive.google.com/file/d/0B3maCn0jCvYncndGbXJKbGlhejQ/view?usp=sharing
>  https://docs.google.com/spreadsheets/d/1GGTivi-MfZU0doMzomG4XUo9ioWtRvOGQ5FId042L6s/edit?usp=sharing
>
>
> Performance number (older):
>
>  Some perf data generated using fio comparing the posix aio engine to a version
>  of the posix AIO engine that attempts to performs "fast" reads before
>  submitting the operations to the queue. This workflow is on ext4 partition on
>  raid0 (test / build-rig.) Simulating our database access patern workload using
>  16kb read accesses. Our database uses a home-spun posix aio like queue (samba
>  does the same thing.)
>
>  f1: ~73% rand read over mostly cached data (zipf med-size dataset)
>  f2: ~18% rand read over mostly un-cached data (uniform large-dataset)
>  f3: ~9% seq-read over large dataset
>
>  before:
>
>  f1:
>      bw (KB  /s): min=   11, max= 9088, per=0.56%, avg=969.54, stdev=827.99
>      lat (msec) : 50=0.01%, 100=1.06%, 250=5.88%, 500=4.08%, 750=12.48%
>      lat (msec) : 1000=17.27%, 2000=49.86%, >=2000=9.42%
>  f2:
>      bw (KB  /s): min=    2, max= 1882, per=0.16%, avg=273.28, stdev=220.26
>      lat (msec) : 250=5.65%, 500=3.31%, 750=15.64%, 1000=24.59%, 2000=46.56%
>      lat (msec) : >=2000=4.33%
>  f3:
>      bw (KB  /s): min=    0, max=265568, per=99.95%, avg=174575.10,
>                   stdev=34526.89
>      lat (usec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.27%, 50=10.82%
>      lat (usec) : 100=50.34%, 250=5.05%, 500=7.12%, 750=6.60%, 1000=4.55%
>      lat (msec) : 2=8.73%, 4=3.49%, 10=1.83%, 20=0.89%, 50=0.22%
>      lat (msec) : 100=0.05%, 250=0.02%, 500=0.01%
>  total:
>     READ: io=102365MB, aggrb=174669KB/s, minb=240KB/s, maxb=173599KB/s,
>           mint=600001msec, maxt=600113msec
>
>  after (with fast read using preadv2 before submit):
>
>  f1:
>      bw (KB  /s): min=    3, max=14897, per=1.28%, avg=2276.69, stdev=2930.39
>      lat (usec) : 2=70.63%, 4=0.01%
>      lat (msec) : 250=0.20%, 500=2.26%, 750=1.18%, 2000=0.22%, >=2000=25.53%
>  f2:
>      bw (KB  /s): min=    2, max= 2362, per=0.14%, avg=249.83, stdev=222.00
>      lat (msec) : 250=6.35%, 500=1.78%, 750=9.29%, 1000=20.49%, 2000=52.18%
>      lat (msec) : >=2000=9.99%
>  f3:
>      bw (KB  /s): min=    1, max=245448, per=100.00%, avg=177366.50,
>                   stdev=35995.60
>      lat (usec) : 2=64.04%, 4=0.01%, 10=0.01%, 20=0.06%, 50=0.43%
>      lat (usec) : 100=0.20%, 250=1.27%, 500=2.93%, 750=3.93%, 1000=7.35%
>      lat (msec) : 2=14.27%, 4=2.88%, 10=1.54%, 20=0.81%, 50=0.22%
>      lat (msec) : 100=0.05%, 250=0.02%
>  total:
>     READ: io=103941MB, aggrb=177339KB/s, minb=213KB/s, maxb=176375KB/s,
>           mint=600020msec, maxt=600178msec
>
>  Interpreting the results you can see total bandwidth stays the same but overall
>  request latency is decreased in f1 (random, mostly cached) and f3 (sequential)
>  workloads. There is a slight bump in latency for since it's random data that's
>  unlikely to be cached but we're always trying "fast read".
>
>  In our application we have starting keeping track of "fast read" hits/misses
>  and for files / requests that have a lot hit ratio we don't do "fast reads"
>  mostly getting rid of extra latency in the uncached cases. In our real world
>  work load we were able to reduce average response time by 20 to 30% (depends
>  on amount of IO done by request).
>
>  I've performed other benchmarks and I have no observed any perf regressions in
>  any of the normal (old) code paths.

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com