linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Milosz Tanski <milosz@adfin.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	linux-aio@kvack.org, Mel Gorman <mgorman@suse.de>,
	Volker Lendecke <Volker.Lendecke@sernet.de>,
	Tejun Heo <tj@kernel.org>, Jeff Moyer <jmoyer@redhat.com>
Subject: Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
Date: Mon, 15 Sep 2014 16:27:24 -0400	[thread overview]
Message-ID: <CANP1eJFsL7uuYqd_LZpkJ5bqSQ44JSWj63tbAbk_MGkBknpoKw@mail.gmail.com> (raw)
In-Reply-To: <cover.1410810247.git.milosz@adfin.com>

As promised here is some performance data. I ended up having up
copying the posix AIO engine and hacking it up to support the preadv2
syscall to perform a "fast read" in the submit thread. Bellow my
observations, followed by test data on a local filesystem (ext4) for
two different test cases (the second one being more of a realistic
case). I also tried this with a remote filesystem (Ceph) where I was
able to get a much better latency improvement.

- I tested two workloads. One is a primarily would be cached work-load
and the other a simulating a more complex workload that tries to mimic
what we would see in our db nodes.
- In the mostly cached case. The bandwidth doesn't increase, but the
request latency is much better. Here the bottleneck on total bandwidth
is probably a single submission thread.
- In the second case we see the same thing we generally. Bandwidth is
more or less the same, request latency is much better in the case of
random read (cached data), and sequential read (due to kernel's
readahead detection). Request latency of random uncached data is worse
(since we do two syscalls).
- Posix AIO probably suffers due to synchronization it could be
improved by a lockless mpmc queue and a aggressive spin before
sleeping wait.
- I can probably improve the uncached latency to be margin of error if
I add miss detection to the submission code (don't try fast read for a
while if a low percentage of those fail).

A lot of possible improvement, but even in its crude state it helps
similar apps (threaded IO worker pool).

Simple in-memory workload (mostly cached), 16kb blocks:

posix_aio:

bw (KB  /s): min=    5, max=29125, per=100.00%, avg=17662.31, stdev=4735.36
lat (usec) : 100=0.17%, 250=0.02%, 500=0.02%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.08%, 10=0.54%, 20=2.97%, 50=40.26%
lat (msec) : 100=49.41%, 250=6.31%, 500=0.21%
READ: io=5171.4MB, aggrb=17649KB/s, minb=17649KB/s, maxb=17649KB/s,
mint=300030msec, maxt=300030msec

posix_aio w/ fast_read:

bw (KB  /s): min=   15, max=38624, per=100.00%, avg=17977.23, stdev=6043.56
lat (usec) : 2=84.33%, 4=0.01%, 10=0.01%, 20=0.01%
lat (msec) : 50=0.01%, 100=0.01%, 250=0.48%, 500=14.45%, 750=0.67%
lat (msec) : 1000=0.05%
READ: io=5235.4MB, aggrb=17849KB/s, minb=17849KB/s, maxb=17849KB/s,
mint=300341msec, maxt=300341msec

Complex workload (simulate our DB access patern), 16kb blocks

f1: ~73% rand read over mostly cached data (zipf med-size dataset)
f2: ~18% rand read over mostly un-cached data (uniform large-dataset)
f3: ~9% seq-read over large dataset

posix_aio:

f1:
    bw (KB  /s): min=   11, max= 9088, per=0.56%, avg=969.54, stdev=827.99
    lat (msec) : 50=0.01%, 100=1.06%, 250=5.88%, 500=4.08%, 750=12.48%
    lat (msec) : 1000=17.27%, 2000=49.86%, >=2000=9.42%
f2:
    bw (KB  /s): min=    2, max= 1882, per=0.16%, avg=273.28, stdev=220.26
    lat (msec) : 250=5.65%, 500=3.31%, 750=15.64%, 1000=24.59%, 2000=46.56%
    lat (msec) : >=2000=4.33%
f3:
    bw (KB  /s): min=    0, max=265568, per=99.95%, avg=174575.10,
stdev=34526.89
    lat (usec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.27%, 50=10.82%
    lat (usec) : 100=50.34%, 250=5.05%, 500=7.12%, 750=6.60%, 1000=4.55%
    lat (msec) : 2=8.73%, 4=3.49%, 10=1.83%, 20=0.89%, 50=0.22%
    lat (msec) : 100=0.05%, 250=0.02%, 500=0.01%
total:
   READ: io=102365MB, aggrb=174669KB/s, minb=240KB/s, maxb=173599KB/s,
mint=600001msec, maxt=600113msec

posix_aio w/ fast_read:

f1:
    bw (KB  /s): min=    3, max=14897, per=1.28%, avg=2276.69, stdev=2930.39
    lat (usec) : 2=70.63%, 4=0.01%
    lat (msec) : 250=0.20%, 500=2.26%, 750=1.18%, 2000=0.22%, >=2000=25.53%
f2:
    bw (KB  /s): min=    2, max= 2362, per=0.14%, avg=249.83, stdev=222.00
    lat (msec) : 250=6.35%, 500=1.78%, 750=9.29%, 1000=20.49%, 2000=52.18%
    lat (msec) : >=2000=9.99%
f3:
    bw (KB  /s): min=    1, max=245448, per=100.00%, avg=177366.50,
stdev=35995.60
    lat (usec) : 2=64.04%, 4=0.01%, 10=0.01%, 20=0.06%, 50=0.43%
    lat (usec) : 100=0.20%, 250=1.27%, 500=2.93%, 750=3.93%, 1000=7.35%
    lat (msec) : 2=14.27%, 4=2.88%, 10=1.54%, 20=0.81%, 50=0.22%
    lat (msec) : 100=0.05%, 250=0.02%
total:
   READ: io=103941MB, aggrb=177339KB/s, minb=213KB/s, maxb=176375KB/s,
mint=600020msec, maxt=600178msec

On Mon, Sep 15, 2014 at 4:20 PM, Milosz Tanski <milosz@adfin.com> wrote:
> This patcheset introduces an ability to perform a non-blocking read from
> regular files in buffered IO mode. This works by only for those filesystems
> that have data in the page cache.
>
> It does this by introducing new syscalls new syscalls readv2/writev2 and
> preadv2/pwritev2. These new syscalls behave like the network sendmsg, recvmsg
> syscalls that accept an extra flag argument (O_NONBLOCK).
>
> It's a very common patern today (samba, libuv, etc..) use a large threadpool to
> perform buffered IO operations. They submit the work form another thread
> that performs network IO and epoll or other threads that perform CPU work. This
> leads to increased latency for processing, esp. in the case of data that's
> already cached in the page cache.
>
> With the new interface the applications will now be able to fetch the data in
> their network / cpu bound thread(s) and only defer to a threadpool if it's not
> there. In our own application (VLDB) we've observed a decrease in latency for
> "fast" request by avoiding unnecessary queuing and having to swap out current
> tasks in IO bound work threads.
>
> I have co-developed these changes with Christoph Hellwig, a whole lot of his
> fixes went into the first patch in the series (were squashed with his
> approval).
>
> I am going to post the perf report in a reply-to to this RFC.
>
> Christoph Hellwig (3):
>   documentation updates
>   move flags enforcement to vfs_preadv/vfs_pwritev
>   check for O_NONBLOCK in all read_iter instances
>
> Milosz Tanski (4):
>   Prepare for adding a new readv/writev with user flags.
>   Define new syscalls readv2,preadv2,writev2,pwritev2
>   Export new vector IO (with flags) to userland
>   O_NONBLOCK flag for readv2/preadv2
>
>  Documentation/filesystems/Locking |    4 +-
>  Documentation/filesystems/vfs.txt |    4 +-
>  arch/x86/syscalls/syscall_32.tbl  |    4 +
>  arch/x86/syscalls/syscall_64.tbl  |    4 +
>  drivers/target/target_core_file.c |    6 +-
>  fs/afs/internal.h                 |    2 +-
>  fs/afs/write.c                    |    4 +-
>  fs/aio.c                          |    4 +-
>  fs/block_dev.c                    |    9 ++-
>  fs/btrfs/file.c                   |    2 +-
>  fs/ceph/file.c                    |   10 ++-
>  fs/cifs/cifsfs.c                  |    9 ++-
>  fs/cifs/cifsfs.h                  |   12 ++-
>  fs/cifs/file.c                    |   30 +++++---
>  fs/ecryptfs/file.c                |    4 +-
>  fs/ext4/file.c                    |    4 +-
>  fs/fuse/file.c                    |   10 ++-
>  fs/gfs2/file.c                    |    5 +-
>  fs/nfs/file.c                     |   13 ++--
>  fs/nfs/internal.h                 |    4 +-
>  fs/nfsd/vfs.c                     |    4 +-
>  fs/ocfs2/file.c                   |   13 +++-
>  fs/pipe.c                         |    7 +-
>  fs/read_write.c                   |  146 +++++++++++++++++++++++++++++++------
>  fs/splice.c                       |    4 +-
>  fs/ubifs/file.c                   |    5 +-
>  fs/udf/file.c                     |    5 +-
>  fs/xfs/xfs_file.c                 |   12 ++-
>  include/linux/fs.h                |   16 ++--
>  include/linux/syscalls.h          |   12 +++
>  include/uapi/asm-generic/unistd.h |   10 ++-
>  mm/filemap.c                      |   34 +++++++--
>  mm/shmem.c                        |    6 +-
>  33 files changed, 306 insertions(+), 112 deletions(-)
>
> --
> 1.7.9.5
>



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

  parent reply	other threads:[~2014-09-15 20:27 UTC|newest]

Thread overview: 86+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-09-15 20:20 [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only) Milosz Tanski
2014-09-15 20:20 ` [PATCH 1/7] Prepare for adding a new readv/writev with user flags Milosz Tanski
2014-09-15 20:28   ` Al Viro
2014-09-15 21:15     ` Christoph Hellwig
2014-09-15 21:44       ` Milosz Tanski
2014-09-15 20:20 ` [PATCH 2/7] Define new syscalls readv2,preadv2,writev2,pwritev2 Milosz Tanski
2014-09-16 19:20   ` Jeff Moyer
2014-09-16 19:54     ` Milosz Tanski
2014-09-16 21:03     ` Christoph Hellwig
2014-09-17 15:43   ` Theodore Ts'o
2014-09-17 16:05     ` Milosz Tanski
2014-09-17 16:59       ` Theodore Ts'o
2014-09-17 17:24         ` Zach Brown
2014-09-15 20:20 ` [PATCH 3/7] Export new vector IO (with flags) to userland Milosz Tanski
2014-09-15 20:21 ` [PATCH 4/7] O_NONBLOCK flag for readv2/preadv2 Milosz Tanski
2014-09-16 19:19   ` Jeff Moyer
2014-09-16 19:44     ` Milosz Tanski
2014-09-16 19:53       ` Jeff Moyer
2014-09-15 20:21 ` [PATCH 5/7] documentation updates Christoph Hellwig
2014-09-15 20:21 ` [PATCH 6/7] move flags enforcement to vfs_preadv/vfs_pwritev Christoph Hellwig
2014-09-15 21:15   ` Christoph Hellwig
2014-09-15 21:45     ` Milosz Tanski
2014-09-15 20:22 ` [PATCH 7/7] check for O_NONBLOCK in all read_iter instances Christoph Hellwig
2014-09-16 19:27   ` Jeff Moyer
2014-09-16 19:45     ` Milosz Tanski
2014-09-16 21:42       ` Dave Chinner
2014-09-17 12:24         ` Benjamin LaHaise
2014-09-17 13:47           ` Theodore Ts'o
2014-09-17 13:56             ` Benjamin LaHaise
2014-09-17 15:33               ` Milosz Tanski
2014-09-17 15:49                 ` Theodore Ts'o
2014-09-17 15:52               ` Zach Brown
2014-09-16 21:04     ` Christoph Hellwig
2014-09-16 21:24       ` Jeff Moyer
2014-09-15 20:27 ` Milosz Tanski [this message]
2014-09-15 21:33 ` [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only) Andreas Dilger
2014-09-15 22:13   ` Milosz Tanski
2014-09-15 22:36   ` Elliott, Robert (Server Storage)
2014-09-16 18:24     ` Zach Brown
2014-09-19 11:21     ` Christoph Hellwig
2014-09-22 15:48       ` Jeff Moyer
2014-09-22 16:32         ` Milosz Tanski
2014-09-22 16:42           ` Christoph Hellwig
2014-09-22 17:02             ` Milosz Tanski
2014-09-22 16:25       ` Elliott, Robert (Server Storage)
2014-09-15 21:58 ` Jeff Moyer
2014-09-15 22:27   ` Milosz Tanski
2014-09-16 13:44     ` Jeff Moyer
2014-09-19 11:23   ` Christoph Hellwig
2014-09-16 19:30 ` Jeff Moyer
2014-09-16 20:34   ` Milosz Tanski
2014-09-16 20:49     ` Jeff Moyer
2014-09-17 14:49 ` [RFC 1/2] aio: async readahead Benjamin LaHaise
2014-09-17 15:26   ` [RFC 2/2] ext4: async readpage for indirect style inodes Benjamin LaHaise
2014-09-19 11:26   ` [RFC 1/2] aio: async readahead Christoph Hellwig
2014-09-19 16:01     ` Benjamin LaHaise
2014-09-17 22:20 ` [RFC v2 0/5] Non-blockling buffered fs read (page cache only) Milosz Tanski
2014-09-17 22:20   ` [RFC v2 1/5] Prepare for adding a new readv/writev with user flags Milosz Tanski
2014-09-17 22:20   ` [RFC v2 2/5] Define new syscalls readv2,preadv2,writev2,pwritev2 Milosz Tanski
2014-09-18 18:48     ` Darrick J. Wong
2014-09-19 10:52       ` Christoph Hellwig
2014-09-20  0:19         ` Darrick J. Wong
2014-09-17 22:20   ` [RFC v2 3/5] Export new vector IO (with flags) to userland Milosz Tanski
2014-09-17 22:20   ` [RFC v2 4/5] O_NONBLOCK flag for readv2/preadv2 Milosz Tanski
2014-09-19 11:27     ` Christoph Hellwig
2014-09-19 11:59       ` Milosz Tanski
2014-09-22 17:12     ` Jeff Moyer
2014-09-17 22:20   ` [RFC v2 5/5] Check for O_NONBLOCK in all read_iter instances Milosz Tanski
2014-09-19 11:26     ` Christoph Hellwig
2014-09-19 14:42   ` [RFC v2 0/5] Non-blockling buffered fs read (page cache only) Jonathan Corbet
2014-09-19 16:13     ` Volker Lendecke
2014-09-19 17:19     ` Milosz Tanski
2014-09-19 17:33     ` Milosz Tanski
2014-09-22 14:12       ` Jonathan Corbet
2014-09-22 14:24         ` Jeff Moyer
2014-09-22 14:25         ` Christoph Hellwig
2014-09-22 14:30         ` Milosz Tanski
2014-09-24 21:46 ` [RFC v3 0/4] vfs: " Milosz Tanski
2014-09-24 21:46   ` [RFC v3 1/4] vfs: Prepare for adding a new preadv/pwritev with user flags Milosz Tanski
2014-09-24 21:46   ` [RFC v3 2/4] vfs: Define new syscalls preadv2,pwritev2 Milosz Tanski
2014-09-24 21:46   ` [RFC v3 3/4] vfs: Export new vector IO syscalls (with flags) to userland Milosz Tanski
2014-09-24 21:46   ` [RFC v3 4/4] vfs: RWF_NONBLOCK flag for preadv2 Milosz Tanski
2014-09-25  4:06   ` [RFC v3 0/4] vfs: Non-blockling buffered fs read (page cache only) Michael Kerrisk
2014-09-25 11:16     ` Jan Kara
2014-09-25 15:48     ` Milosz Tanski
2014-10-08  2:53   ` Milosz Tanski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CANP1eJFsL7uuYqd_LZpkJ5bqSQ44JSWj63tbAbk_MGkBknpoKw@mail.gmail.com \
    --to=milosz@adfin.com \
    --cc=Volker.Lendecke@sernet.de \
    --cc=hch@infradead.org \
    --cc=jmoyer@redhat.com \
    --cc=linux-aio@kvack.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).