All of lore.kernel.org
 help / color / mirror / Atom feed
From: Pavel Butsykin <pbutsykin@virtuozzo.com>
To: Avi Kivity <avi@scylladb.com>,
	qemu-block@nongnu.org, qemu-devel@nongnu.org
Cc: kwolf@redhat.com, famz@redhat.com, mreitz@redhat.com,
	stefanha@redhat.com, den@openvz.org, jsnow@redhat.com
Subject: Re: [Qemu-devel] [PATCH RFC v2 00/22] I/O prefetch cache
Date: Tue, 6 Sep 2016 15:40:55 +0300	[thread overview]
Message-ID: <57CEB957.5050009@virtuozzo.com> (raw)
In-Reply-To: <83595cde-6b37-20c2-a37d-e6b030a005a6@scylladb.com>

On 01.09.2016 18:26, Avi Kivity wrote:
> On 08/29/2016 08:09 PM, Pavel Butsykin wrote:
>> The prefetch cache aims to improve the performance of sequential read
>> data.
>> Of most interest here are the requests of a small size of data for
>> sequential
>> read, such requests can be optimized by extending them and moving into
>> the prefetch cache. However, there are 2 issues:
>>   - In aggregate only a small portion of requests is sequential, so
>> delays caused
>>     by the need to read more volumes of data will lead to an overall
>> decrease
>>     in performance.
>>   - The presence of redundant data in the cache memory with a large
>> number of
>>     random requests.
>> This pcache implementation solves the above and other problems
>> prefetching data.
>> The pcache algorithm can be summarised by the following main steps.
>>
>> 1. Monitor I/O requests to identify typical sequences.
>> This implementation of prefetch cache works at the storage system
>> level and has
>> information only about the physical block addresses of I/O requests.
>> Statistics
>> are collected only from read requests to a maximum size of 32kb(by
>> default),
>> each request that matches the criteria falls into a pool of requests.
>> In order
>> to store requests statistic used by the rb-tree(lreq.tree), it's
>> simple but for
>> this issue a quite efficient data structure.
>>
>> 2. Identifying sequential I/O streams.
>> For each read request to be carried out attempting to lift the chain
>> sequence
>> from lreq.tree, where this request will be element of a sequential
>> chain of
>> requests. The key to search for consecutive requests is the area of
>> sectors
>> preceding the current request. The size of this area should not be too
>> small to
>> avoid false readahead. The sequential stream data requests can be
>> identified
>> even when a large number of random requests. For example, if there is
>> access to
>> the blocks 100, 1157, 27520, 4, 101, 312, 1337, 102, in the context of
>> request
>> processing 102 will be identified the chain of sequential requests
>> 100, 101. 102
>> and then should a decision be made to do readahead. Also a situation
>> may arise
>> when multiple applications A, B, C simultaneously perform sequential
>> read of
>> data. For each separate application that will be sequential read data
>> A(100, 101, 102), B(300, 301, 302), C(700, 701, 702), but for block
>> devices it
>> may look like a random data reading: 100,300,700,101,301,701,102,302,702.
>> In this case, the sequential streams will also be recognised because
>> location
>> requests in the rb-tree will allow to separate the sequential I/O
>> streams.
>>
>> 3. Do the readahead into the cache for recognized sequential data
>> streams.
>> After the issue of the detection of pcache case was resolved, need
>> using larger
>> requests to bring data into the cache. In this implementation the
>> pcache used
>> readahead instead of the extension request, therefore the request goes
>> as is.
>> There is not any reason to put data in the cache that will never be
>> picked up,
>> but this will always happen in the case of extension requests. In
>> order to store
>> areas of cached blocks is also used by the rb-tree(pcache.tree), it's
>> simple but
>> for this issue a quite efficient data structure.
>>
>> 4. Control size of the prefetch cache pool and the requests statistic
>> pool
>> For control the border of the pool statistic of requests, the data of
>> requests
>> are placed and replaced according to the FIFO principle, everything is
>> simple.
>> For control the boundaries of the memory cache used LRU list, it
>> allows to limit
>> the max amount memory that we can allocate for pcache. But the LRU is
>> there
>> mainly to prevent displacement of the cache blocks that was read
>> partially.
>> The main way the memory is pushed out immediately after use, as soon
>> as a chunk
>> of memory from the cache has been completely read, since the
>> probability of
>> repetition of the request is very low. Cases when one and the same
>> portion of
>> the cache memory has been read several times are not optimized and do
>> not apply
>> to the cases that can optimize the pcache. Thus, using a cache memory
>> of small
>> volume, by the optimization of the operations read-ahead and clear
>> memory, we
>> can read entire volumes of data, providing a 100% cache hit. Also does
>> not
>> decrease the effectiveness of random read requests.
>>
>> PCache is implemented as a qemu block filter driver, has some
>> configurable
>> parameters, such as: total cache size, readahead size, maximum size of
>> block
>> that can be processed.
>>
>> For performance evaluation has been used several test cases with
>> different
>> sequential and random read data on SSD disk. Here are the results of
>> tests and
>> qemu parameters:
>>
>> qemu parameters:
>> -M pc-i440fx-2.4 --enable-kvm -smp 4 -m 1024
>> -drive
>> file=centos7.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none,
>>         aio=native,pcache-full-size=4MB,pcache-readahead-size=128KB,
>>         pcache-max-aio-size=32KB
>> -device
>> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x8,drive=drive-virtio-disk0,
>>          id=virtio-disk0
>> (-set device.virtio-disk0.x-data-plane=on)
>>
>> ********************************************************************************
>>
>> * Testcase                        * Results in
>> iops                            *
>> *
>> **********************************************
>> *                                 * clean qemu   * pcache       *
>> x-data-plane *
>> ********************************************************************************
>>
>> * Create/open 16 file(s) of total * 25514 req/s  * 85659 req/s  *
>> 28249 req/s  *
>> * size 2048.00 MB named           * 25692 req/s  * 89064 req/s  *
>> 27950 req/s  *
>> * /tmp/tmp.tmp, start 4 thread(s) * 25836 req/s  * 84142 req/s  *
>> 28120 req/s  *
>> * and do uncached sequential read *              *
>> *              *
>> * by 4KB blocks                   *              *
>> *              *
>> ********************************************************************************
>>
>> * Create/open 16 file(s) of total * 56006 req/s  * 92137 req/s  *
>> 56992 req/s  *
>> * size 2048.00 MB named           * 55335 req/s  * 92269 req/s  *
>> 57023 req/s  *
>> * /tmp/tmp.tmp, start 4 thread(s) * 55731 req/s  * 98722 req/s  *
>> 56593 req/s  *
>> * and do uncached sequential read *              *
>> *              *
>> * by 4KB blocks with constant     *              *
>> *              *
>> ********************************************************************************
>>
>> * Create/open 16 file(s) of total * 14104 req/s  * 14164 req/s  *
>> 13914 req/s  *
>> * size 2048.00 MB named           * 14130 req/s  * 14232 req/s  *
>> 13613 req/s  *
>> * /tmp/tmp.tmp, start 4 thread(s) * 14183 req/s  * 14080 req/s  *
>> 13374 req/s  *
>> * and do uncached random read by  *              *
>> *              *
>> * 4KB blocks                      *              *
>> *              *
>> ********************************************************************************
>>
>> * Create/open 16 file(s) of total * 23480 req/s  * 23483 req/s  *
>> 20887 req/s  *
>> * size 2048.00 MB named           * 23070 req/s  * 22432 req/s  *
>> 21127 req/s  *
>> * /tmp/tmp.tmp, start 4 thread(s) * 24090 req/s  * 23499 req/s  *
>> 23415 req/s  *
>> * and do uncached random read by  *              *
>> *              *
>> * 4KB blocks with constant queue  *              *
>> *              *
>> * len 32                          *              *
>> *              *
>> ********************************************************************************
>>
>
>
> I note, in your tests, you use uncached sequential reads.  But are
> uncached sequential reads with a small block size common?
>
> Consider the case of cached sequential reads.  Here, the guest OS will
> issue read-aheads.  pcache will detect them and issue its own
> read-aheads, both layers will read ahead more than necessary, so pcache
> is adding extra I/O and memory copies here.
>
Yes, guests can have their own read-ahead cache, but pcache in this
case doesn't lead to excessive activity, because the first guest
read-ahead request hit in the pcache memory, and the next read-ahead
requests will be filtered out on the side of pcache. This is only for
the same size window, but if the window size is different, then a
concurrent read-ahead request will never happen. Even if simultaneous
read-ahead request can leads to extra I/O, it is only a problem of
pcache implementation.

> So I'm wondering about the use case.  Guest userspace applications which
> do uncached reads will typically manage their own read-ahead; and cached
> reads have the kernel reading ahead for them, with the benefit of
> knowing the file layout.  That leaves dd iflag=direct, but is it such an
> important application?
>
It helps with live loads on Windows. A simple example, Windows
boot(win8.1 1024-RAM), even with enabled Windows Prefetcher leads to
reading about 300MB from pcache memory. It should be understood that
pcache is designed for optimizing the guest's behaviour as a whole and
not any apps inside. Guest read-ahead is tied to fd, and aimed at
optimizing userspace application, but pcache is several levels above
that allows us to cover other cases. Another example is walking a
directory tree. This effect happens because, when traversing a
directory tree, there big chance that some fs blocks can be placed
sequentially. But in generally, pcache helps to reduce latency under
high load for Windows VMs.

>> TODO list:
>> - add tracepoints
>> - add migration support
>> - add more explanations in the commit messages
>> - get rid of the additional allocation in
>> pcache_node_find_and_create() and
>>    pcache_aio_readv()
>>
>> Changes from v1:
>> - Fix failed automatic build test (11)
>>
>> Pavel Butsykin (22):
>>    block/pcache: empty pcache driver filter
>>    block/pcache: add own AIOCB block
>>    util/rbtree: add rbtree from linux kernel
>>    block/pcache: add pcache debug build
>>    block/pcache: add aio requests into cache
>>    block/pcache: restrict cache size
>>    block/pcache: introduce LRU as method of memory
>>    block/pcache: implement pickup parts of the cache
>>    block/pcache: separation AIOCB on requests
>>    block/pcache: add check node leak
>>    add QEMU style defines for __sync_add_and_fetch
>>    block/pcache: implement read cache to qiov and drop node during aio
>>      write
>>    block/pcache: add generic request complete
>>    block/pcache: add support for rescheduling requests
>>    block/pcache: simple readahead one chunk forward
>>    block/pcache: pcache readahead node around
>>    block/pcache: skip readahead for non-sequential requests
>>    block/pcache: add pcache skip large aio read
>>    block/pcache: add pcache node assert
>>    block/pcache: implement pcache error handling of aio cb
>>    block/pcache: add write through node
>>    block/pcache: drop used pcache node
>>
>>   block/Makefile.objs             |    1 +
>>   block/pcache.c                  | 1224
>> +++++++++++++++++++++++++++++++++++++++
>>   include/qemu/atomic.h           |    8 +
>>   include/qemu/rbtree.h           |  109 ++++
>>   include/qemu/rbtree_augmented.h |  237 ++++++++
>>   util/Makefile.objs              |    1 +
>>   util/rbtree.c                   |  570 ++++++++++++++++++
>>   7 files changed, 2150 insertions(+)
>>   create mode 100644 block/pcache.c
>>   create mode 100644 include/qemu/rbtree.h
>>   create mode 100644 include/qemu/rbtree_augmented.h
>>   create mode 100644 util/rbtree.c
>>
>

      reply	other threads:[~2016-09-06 14:13 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-29 17:09 [Qemu-devel] [PATCH RFC v2 00/22] I/O prefetch cache Pavel Butsykin
2016-08-29 17:10 ` [Qemu-devel] [PATCH RFC v2 01/22] block/pcache: empty pcache driver filter Pavel Butsykin
2016-09-01 14:31   ` Kevin Wolf
2016-09-06 15:20     ` Pavel Butsykin
2016-08-29 17:10 ` [Qemu-devel] [PATCH RFC v2 02/22] block/pcache: add own AIOCB block Pavel Butsykin
2016-08-29 17:10 ` [Qemu-devel] [PATCH RFC v2 03/22] util/rbtree: add rbtree from linux kernel Pavel Butsykin
2016-09-01 14:37   ` Kevin Wolf
2016-08-29 17:10 ` [Qemu-devel] [PATCH RFC v2 04/22] block/pcache: add pcache debug build Pavel Butsykin
2016-09-08 15:11   ` Eric Blake
2016-09-08 15:49     ` Pavel Butsykin
2016-09-08 16:05       ` Pavel Butsykin
2016-09-08 18:42         ` Eric Blake
2016-08-29 17:10 ` [Qemu-devel] [PATCH RFC v2 05/22] block/pcache: add aio requests into cache Pavel Butsykin
2016-09-01 15:28   ` Kevin Wolf
2016-09-06 16:54     ` Pavel Butsykin
2016-09-06 17:07       ` Kevin Wolf
2016-09-07 16:21         ` Pavel Butsykin
2016-08-29 17:10 ` [Qemu-devel] [PATCH RFC v2 06/22] block/pcache: restrict cache size Pavel Butsykin
2016-08-29 17:10 ` [Qemu-devel] [PATCH RFC v2 07/22] block/pcache: introduce LRU as method of memory Pavel Butsykin
2016-09-02  8:49   ` Kevin Wolf
2016-08-29 17:10 ` [Qemu-devel] [PATCH RFC v2 08/22] block/pcache: implement pickup parts of the cache Pavel Butsykin
2016-09-02  8:59   ` Kevin Wolf
2016-09-08 12:29     ` Pavel Butsykin
2016-08-29 17:10 ` [Qemu-devel] [PATCH RFC v2 09/22] block/pcache: separation AIOCB on requests Pavel Butsykin
2016-09-02  9:10   ` Kevin Wolf
2016-09-08 15:47     ` Pavel Butsykin
2016-08-29 17:10 ` [Qemu-devel] [PATCH RFC v2 10/22] block/pcache: add check node leak Pavel Butsykin
2016-08-29 17:10 ` [Qemu-devel] [PATCH RFC v2 11/22] add QEMU style defines for __sync_add_and_fetch Pavel Butsykin
2016-08-29 17:10 ` [Qemu-devel] [PATCH RFC v2 12/22] block/pcache: implement read cache to qiov and drop node during aio write Pavel Butsykin
2016-08-29 17:10 ` [Qemu-devel] [PATCH RFC v2 13/22] block/pcache: add generic request complete Pavel Butsykin
2016-08-29 17:10 ` [Qemu-devel] [PATCH RFC v2 14/22] block/pcache: add support for rescheduling requests Pavel Butsykin
2016-08-29 17:10 ` [Qemu-devel] [PATCH RFC v2 15/22] block/pcache: simple readahead one chunk forward Pavel Butsykin
2016-08-29 17:10 ` [Qemu-devel] [PATCH RFC v2 16/22] block/pcache: pcache readahead node around Pavel Butsykin
2016-08-29 17:10 ` [Qemu-devel] [PATCH RFC v2 17/22] block/pcache: skip readahead for non-sequential requests Pavel Butsykin
2016-08-29 17:10 ` [Qemu-devel] [PATCH RFC v2 18/22] block/pcache: add pcache skip large aio read Pavel Butsykin
2016-08-29 17:10 ` [Qemu-devel] [PATCH RFC v2 19/22] block/pcache: add pcache node assert Pavel Butsykin
2016-08-29 17:10 ` [Qemu-devel] [PATCH RFC v2 20/22] block/pcache: implement pcache error handling of aio cb Pavel Butsykin
2016-08-29 17:10 ` [Qemu-devel] [PATCH RFC v2 21/22] block/pcache: add write through node Pavel Butsykin
2016-08-29 17:10 ` [Qemu-devel] [PATCH RFC v2 22/22] block/pcache: drop used pcache node Pavel Butsykin
2016-09-01 14:17 ` [Qemu-devel] [PATCH RFC v2 00/22] I/O prefetch cache Kevin Wolf
2016-09-06 12:36   ` Pavel Butsykin
2016-09-01 15:26 ` Avi Kivity
2016-09-06 12:40   ` Pavel Butsykin [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=57CEB957.5050009@virtuozzo.com \
    --to=pbutsykin@virtuozzo.com \
    --cc=avi@scylladb.com \
    --cc=den@openvz.org \
    --cc=famz@redhat.com \
    --cc=jsnow@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.