Re: [Qemu-devel] [PATCH RFC v2 00/22] I/O prefetch cache

From: Pavel Butsykin <pbutsykin@virtuozzo.com>
To: Avi Kivity <avi@scylladb.com>,
	qemu-block@nongnu.org, qemu-devel@nongnu.org
Cc: kwolf@redhat.com, famz@redhat.com, mreitz@redhat.com,
	stefanha@redhat.com, den@openvz.org, jsnow@redhat.com
Subject: Re: [Qemu-devel] [PATCH RFC v2 00/22] I/O prefetch cache
Date: Tue, 6 Sep 2016 15:40:55 +0300	[thread overview]
Message-ID: <57CEB957.5050009@virtuozzo.com> (raw)
In-Reply-To: <83595cde-6b37-20c2-a37d-e6b030a005a6@scylladb.com>

On 01.09.2016 18:26, Avi Kivity wrote:
> On 08/29/2016 08:09 PM, Pavel Butsykin wrote:
>> The prefetch cache aims to improve the performance of sequential read
>> data.
>> Of most interest here are the requests of a small size of data for
>> sequential
>> read, such requests can be optimized by extending them and moving into
>> the prefetch cache. However, there are 2 issues:
>>   - In aggregate only a small portion of requests is sequential, so
>> delays caused
>>     by the need to read more volumes of data will lead to an overall
>> decrease
>>     in performance.
>>   - The presence of redundant data in the cache memory with a large
>> number of
>>     random requests.
>> This pcache implementation solves the above and other problems
>> prefetching data.
>> The pcache algorithm can be summarised by the following main steps.
>>
>> 1. Monitor I/O requests to identify typical sequences.
>> This implementation of prefetch cache works at the storage system
>> level and has
>> information only about the physical block addresses of I/O requests.
>> Statistics
>> are collected only from read requests to a maximum size of 32kb(by
>> default),
>> each request that matches the criteria falls into a pool of requests.
>> In order
>> to store requests statistic used by the rb-tree(lreq.tree), it's
>> simple but for
>> this issue a quite efficient data structure.
>>
>> 2. Identifying sequential I/O streams.
>> For each read request to be carried out attempting to lift the chain
>> sequence
>> from lreq.tree, where this request will be element of a sequential
>> chain of
>> requests. The key to search for consecutive requests is the area of
>> sectors
>> preceding the current request. The size of this area should not be too
>> small to
>> avoid false readahead. The sequential stream data requests can be
>> identified
>> even when a large number of random requests. For example, if there is
>> access to
>> the blocks 100, 1157, 27520, 4, 101, 312, 1337, 102, in the context of
>> request
>> processing 102 will be identified the chain of sequential requests
>> 100, 101. 102
>> and then should a decision be made to do readahead. Also a situation
>> may arise
>> when multiple applications A, B, C simultaneously perform sequential
>> read of
>> data. For each separate application that will be sequential read data
>> A(100, 101, 102), B(300, 301, 302), C(700, 701, 702), but for block
>> devices it
>> may look like a random data reading: 100,300,700,101,301,701,102,302,702.
>> In this case, the sequential streams will also be recognised because
>> location
>> requests in the rb-tree will allow to separate the sequential I/O
>> streams.
>>
>> 3. Do the readahead into the cache for recognized sequential data
>> streams.
>> After the issue of the detection of pcache case was resolved, need
>> using larger
>> requests to bring data into the cache. In this implementation the
>> pcache used
>> readahead instead of the extension request, therefore the request goes
>> as is.
>> There is not any reason to put data in the cache that will never be
>> picked up,
>> but this will always happen in the case of extension requests. In
>> order to store
>> areas of cached blocks is also used by the rb-tree(pcache.tree), it's
>> simple but
>> for this issue a quite efficient data structure.
>>
>> 4. Control size of the prefetch cache pool and the requests statistic
>> pool
>> For control the border of the pool statistic of requests, the data of
>> requests
>> are placed and replaced according to the FIFO principle, everything is
>> simple.
>> For control the boundaries of the memory cache used LRU list, it
>> allows to limit
>> the max amount memory that we can allocate for pcache. But the LRU is
>> there
>> mainly to prevent displacement of the cache blocks that was read
>> partially.
>> The main way the memory is pushed out immediately after use, as soon
>> as a chunk
>> of memory from the cache has been completely read, since the
>> probability of
>> repetition of the request is very low. Cases when one and the same
>> portion of
>> the cache memory has been read several times are not optimized and do
>> not apply
>> to the cases that can optimize the pcache. Thus, using a cache memory
>> of small
>> volume, by the optimization of the operations read-ahead and clear
>> memory, we
>> can read entire volumes of data, providing a 100% cache hit. Also does
>> not
>> decrease the effectiveness of random read requests.
>>
>> PCache is implemented as a qemu block filter driver, has some
>> configurable
>> parameters, such as: total cache size, readahead size, maximum size of
>> block
>> that can be processed.
>>
>> For performance evaluation has been used several test cases with
>> different
>> sequential and random read data on SSD disk. Here are the results of
>> tests and
>> qemu parameters:
>>
>> qemu parameters:
>> -M pc-i440fx-2.4 --enable-kvm -smp 4 -m 1024
>> -drive
>> file=centos7.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none,
>>         aio=native,pcache-full-size=4MB,pcache-readahead-size=128KB,
>>         pcache-max-aio-size=32KB
>> -device
>> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x8,drive=drive-virtio-disk0,
>>          id=virtio-disk0
>> (-set device.virtio-disk0.x-data-plane=on)
>>
>> ********************************************************************************
>>
>> * Testcase                        * Results in
>> iops                            *
>> *
>> **********************************************
>> *                                 * clean qemu   * pcache       *
>> x-data-plane *
>> ********************************************************************************
>>
>> * Create/open 16 file(s) of total * 25514 req/s  * 85659 req/s  *
>> 28249 req/s  *
>> * size 2048.00 MB named           * 25692 req/s  * 89064 req/s  *
>> 27950 req/s  *
>> * /tmp/tmp.tmp, start 4 thread(s) * 25836 req/s  * 84142 req/s  *
>> 28120 req/s  *
>> * and do uncached sequential read *              *
>> *              *
>> * by 4KB blocks                   *              *
>> *              *
>> ********************************************************************************
>>
>> * Create/open 16 file(s) of total * 56006 req/s  * 92137 req/s  *
>> 56992 req/s  *
>> * size 2048.00 MB named           * 55335 req/s  * 92269 req/s  *
>> 57023 req/s  *
>> * /tmp/tmp.tmp, start 4 thread(s) * 55731 req/s  * 98722 req/s  *
>> 56593 req/s  *
>> * and do uncached sequential read *              *
>> *              *
>> * by 4KB blocks with constant     *              *
>> *              *
>> ********************************************************************************
>>
>> * Create/open 16 file(s) of total * 14104 req/s  * 14164 req/s  *
>> 13914 req/s  *
>> * size 2048.00 MB named           * 14130 req/s  * 14232 req/s  *
>> 13613 req/s  *
>> * /tmp/tmp.tmp, start 4 thread(s) * 14183 req/s  * 14080 req/s  *
>> 13374 req/s  *
>> * and do uncached random read by  *              *
>> *              *
>> * 4KB blocks                      *              *
>> *              *
>> ********************************************************************************
>>
>> * Create/open 16 file(s) of total * 23480 req/s  * 23483 req/s  *
>> 20887 req/s  *
>> * size 2048.00 MB named           * 23070 req/s  * 22432 req/s  *
>> 21127 req/s  *
>> * /tmp/tmp.tmp, start 4 thread(s) * 24090 req/s  * 23499 req/s  *
>> 23415 req/s  *
>> * and do uncached random read by  *              *
>> *              *
>> * 4KB blocks with constant queue  *              *
>> *              *
>> * len 32                          *              *
>> *              *
>> ********************************************************************************
>>
>
>
> I note, in your tests, you use uncached sequential reads.  But are
> uncached sequential reads with a small block size common?
>
> Consider the case of cached sequential reads.  Here, the guest OS will
> issue read-aheads.  pcache will detect them and issue its own
> read-aheads, both layers will read ahead more than necessary, so pcache
> is adding extra I/O and memory copies here.
>
Yes, guests can have their own read-ahead cache, but pcache in this
case doesn't lead to excessive activity, because the first guest
read-ahead request hit in the pcache memory, and the next read-ahead
requests will be filtered out on the side of pcache. This is only for
the same size window, but if the window size is different, then a
concurrent read-ahead request will never happen. Even if simultaneous
read-ahead request can leads to extra I/O, it is only a problem of
pcache implementation.

> So I'm wondering about the use case.  Guest userspace applications which
> do uncached reads will typically manage their own read-ahead; and cached
> reads have the kernel reading ahead for them, with the benefit of
> knowing the file layout.  That leaves dd iflag=direct, but is it such an
> important application?
>
It helps with live loads on Windows. A simple example, Windows
boot(win8.1 1024-RAM), even with enabled Windows Prefetcher leads to
reading about 300MB from pcache memory. It should be understood that
pcache is designed for optimizing the guest's behaviour as a whole and
not any apps inside. Guest read-ahead is tied to fd, and aimed at
optimizing userspace application, but pcache is several levels above
that allows us to cover other cases. Another example is walking a
directory tree. This effect happens because, when traversing a
directory tree, there big chance that some fs blocks can be placed
sequentially. But in generally, pcache helps to reduce latency under
high load for Windows VMs.

>> TODO list:
>> - add tracepoints
>> - add migration support
>> - add more explanations in the commit messages
>> - get rid of the additional allocation in
>> pcache_node_find_and_create() and
>>    pcache_aio_readv()
>>
>> Changes from v1:
>> - Fix failed automatic build test (11)
>>
>> Pavel Butsykin (22):
>>    block/pcache: empty pcache driver filter
>>    block/pcache: add own AIOCB block
>>    util/rbtree: add rbtree from linux kernel
>>    block/pcache: add pcache debug build
>>    block/pcache: add aio requests into cache
>>    block/pcache: restrict cache size
>>    block/pcache: introduce LRU as method of memory
>>    block/pcache: implement pickup parts of the cache
>>    block/pcache: separation AIOCB on requests
>>    block/pcache: add check node leak
>>    add QEMU style defines for __sync_add_and_fetch
>>    block/pcache: implement read cache to qiov and drop node during aio
>>      write
>>    block/pcache: add generic request complete
>>    block/pcache: add support for rescheduling requests
>>    block/pcache: simple readahead one chunk forward
>>    block/pcache: pcache readahead node around
>>    block/pcache: skip readahead for non-sequential requests
>>    block/pcache: add pcache skip large aio read
>>    block/pcache: add pcache node assert
>>    block/pcache: implement pcache error handling of aio cb
>>    block/pcache: add write through node
>>    block/pcache: drop used pcache node
>>
>>   block/Makefile.objs             |    1 +
>>   block/pcache.c                  | 1224
>> +++++++++++++++++++++++++++++++++++++++
>>   include/qemu/atomic.h           |    8 +
>>   include/qemu/rbtree.h           |  109 ++++
>>   include/qemu/rbtree_augmented.h |  237 ++++++++
>>   util/Makefile.objs              |    1 +
>>   util/rbtree.c                   |  570 ++++++++++++++++++
>>   7 files changed, 2150 insertions(+)
>>   create mode 100644 block/pcache.c
>>   create mode 100644 include/qemu/rbtree.h
>>   create mode 100644 include/qemu/rbtree_augmented.h
>>   create mode 100644 util/rbtree.c
>>
>