All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v1 00/18] I/O prefetch cache
@ 2016-11-15  6:36 Pavel Butsykin
  2016-11-15  6:36 ` [Qemu-devel] [PATCH v1 01/18] block/io: add bdrv_aio_{preadv, pwritev} Pavel Butsykin
                   ` (18 more replies)
  0 siblings, 19 replies; 32+ messages in thread
From: Pavel Butsykin @ 2016-11-15  6:36 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: den, famz, stefanha, mreitz, kwolf, eblake

The prefetch cache aims to improve the performance of sequential read data.
Of most interest here are the requests of a small size of data for sequential
read, such requests can be optimized by extending them and moving into 
the prefetch cache. However, there are 2 issues:
 - In aggregate only a small portion of requests is sequential, so delays caused
   by the need to read more volumes of data will lead to an overall decrease
   in performance.
 - The presence of redundant data in the cache memory with a large number of
   random requests.
This pcache implementation solves the above and other problems prefetching data.
The pcache algorithm can be summarised by the following main steps.

1. Monitor I/O requests to identify typical sequences.
This implementation of prefetch cache works at the storage system level and has 
information only about the physical block addresses of I/O requests. Statistics 
are collected only from read requests to a maximum size of 64kb(by default),
each request that matches the criteria falls into a pool of requests. In order
to store request statistics used by the rb-tree, it's simple but for
this issue a quite efficient data structure.

2. Identifying sequential I/O streams.
For each read request to be carried out attempting to lift the chain sequence 
from the tree statistics, where this request will be element of a sequential
chain of requests. The key to search for consecutive requests is the area of bytes 
preceding the current request. The size of this area should not be too small to 
avoid false readahead. The sequential stream data requests can be identified
even when a large number of random requests. For example, if there is access to 
the blocks 100, 1157, 27520, 4, 101, 312, 1337, 102, in the context of request
processing 102 will be identified the chain of sequential requests 100, 101. 102
and then should a decision be made to do readahead. Also a situation may arise
when multiple applications A, B, C simultaneously perform sequential read of
data. For each separate application that will be sequential read data 
A(100, 101, 102), B(300, 301, 302), C(700, 701, 702), but for block devices it 
may look like a random data reading: 100,300,700,101,301,701,102,302,702. 
In this case, the sequential streams will also be recognised because location
requests in the rb-tree will allow to separate the sequential I/O streams.

3. Do the readahead into the cache for recognized sequential data streams.
After the issue of the detection of pcache case was resolved, need using larger 
requests to bring data into the cache. In this implementation the pcache used
readahead instead of the extension request, therefore the request goes as is. 
There is not any reason to put data in the cache that will never be picked up, 
but this will always happen in the case of extension requests. In order to store
areas of cached blocks is also used the rb-tree, it's simple but for this issue
a quite efficient data structure.

4. Control size of the prefetch cache pool and the requests statistic pool
For control the border of the pool statistic of requests, the data of requests 
are placed and replaced according to the FIFO principle, everything is simple.
For control the boundaries of the memory cache used LRU list, it allows to limit
the max amount memory that we can allocate for pcache. But the LRU is there
mainly to prevent displacement of the cache blocks that was read partially. 
The main way the memory is pushed out immediately after use, as soon as a chunk
of memory from the cache has been completely read, since the probability of
repetition of the request is very low. Cases when one and the same portion of
the cache memory has been read several times are not optimized and do not apply
to the cases that can optimize the pcache. Thus, using a cache memory of small
volume, by the optimization of the operations read-ahead and clear memory, we
can read entire volumes of data, providing a 100% cache hit. Also does not
decrease the effectiveness of random read requests.

PCache is implemented as a qemu block filter driver, has some configurable
parameters, such as: total cache size, statistics size, readahead size,
maximum size of block that can be processed.

For performance evaluation has been used several test cases with different
sequential and random read data on SSD disk. Here are the results of tests and
qemu parameters:

qemu parameters: 
-machine pc,accel=kvm,usb=off,vmport=off -m 1024 -smp 8
-device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4
-drive file=/img/harddisk.hdd,format=pcache,if=none,id=drive-scsi0-0-0-0,cache=none,aio=native 
-device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0

*****************************************************************
* Testcase                        * Results in iops             *
*                                 *******************************
*                                 *  clean qemu  *    pcache    *
*****************************************************************
* Create/open 16 file(s) of total * 21645 req/s  * 74793 req/s  *
* size 2048.00 MB named           * 20385 req/s  * 66481 req/s  *
* /tmp/tmp.tmp, start 4 thread(s) * 20616 req/s  * 69007 req/s  *
* and do uncached sequential read *              *              *
* by 4KB blocks                   *              *              *
*****************************************************************
* Create/open 16 file(s) of total * 84033 req/s  * 87828 req/s  *
* size 2048.00 MB named           * 84602 req/s  * 89678 req/s  *
* /tmp/tmp.tmp, start 4 thread(s) * 83163 req/s  * 96202 req/s  *
* and do uncached sequential read *              *              *
* by 4KB blocks with constant     *              *              *
* queue len 32                    *              *              *
*****************************************************************
* Create/open 16 file(s) of total * 14104 req/s  * 14164 req/s  *
* size 2048.00 MB named           * 14130 req/s  * 14232 req/s  *
* /tmp/tmp.tmp, start 4 thread(s) * 14183 req/s  * 14080 req/s  *
* and do uncached random read by  *              *              *
* 4KB blocks                      *              *              *
*****************************************************************
* Create/open 16 file(s) of total * 23480 req/s  * 23483 req/s  *
* size 2048.00 MB named           * 23070 req/s  * 22432 req/s  *
* /tmp/tmp.tmp, start 4 thread(s) * 24090 req/s  * 23499 req/s  *
* and do uncached random read by  *              *              *
* 4KB blocks with constant queue  *              *              *
* len 32                          *              *              *
*****************************************************************

Pavel Butsykin (18):
  block/io: add bdrv_aio_{preadv, pwritev}
  block/pcache: empty pcache driver filter
  util/rbtree: add rbtree from linux kernel
  util/rbcache: range-based cache core
  tests/test-rbcache: add test cases
  block/pcache: statistics collection read requests
  block/pcache: skip large aio read
  block/pcache: updating statistics for overlapping requests
  block/pcache: add AIO readahead
  block/pcache: skip readahead for unallocated clusters
  block/pcache: cache invalidation on AIO write requests
  block/pcache: add reading data from the cache
  block/pcache: inflight readahead request waiting for aio read
  block/pcache: pick up parts of the cache
  block/pcache: drop used pcache nodes
  block/pcache: write through
  block/pcache: add tracepoints
  block/pcache: debug build

 MAINTAINERS                     |  13 +
 block/Makefile.objs             |   1 +
 block/io.c                      |  16 +
 block/pcache.c                  | 771 ++++++++++++++++++++++++++++++++++++++++
 block/trace-events              |   9 +
 include/block/block.h           |   7 +
 include/qemu/rbcache.h          | 105 ++++++
 include/qemu/rbtree.h           | 107 ++++++
 include/qemu/rbtree_augmented.h | 235 ++++++++++++
 tests/Makefile.include          |   3 +
 tests/test-rbcache.c            | 308 ++++++++++++++++
 util/Makefile.objs              |   2 +
 util/rbcache.c                  | 246 +++++++++++++
 util/rbtree.c                   | 570 +++++++++++++++++++++++++++++
 14 files changed, 2393 insertions(+)
 create mode 100644 block/pcache.c
 create mode 100644 include/qemu/rbcache.h
 create mode 100644 include/qemu/rbtree.h
 create mode 100644 include/qemu/rbtree_augmented.h
 create mode 100644 tests/test-rbcache.c
 create mode 100644 util/rbcache.c
 create mode 100644 util/rbtree.c

-- 
2.10.1

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v1 01/18] block/io: add bdrv_aio_{preadv, pwritev}
  2016-11-15  6:36 [Qemu-devel] [PATCH v1 00/18] I/O prefetch cache Pavel Butsykin
@ 2016-11-15  6:36 ` Pavel Butsykin
  2016-11-23 14:28   ` Kevin Wolf
  2016-11-15  6:36 ` [Qemu-devel] [PATCH v1 02/18] block/pcache: empty pcache driver filter Pavel Butsykin
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 32+ messages in thread
From: Pavel Butsykin @ 2016-11-15  6:36 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: den, famz, stefanha, mreitz, kwolf, eblake

It's just byte-based wrappers over bdrv_co_aio_prw_vector(), which provide
 a byte-based interface for AIO read/write.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/io.c            | 16 ++++++++++++++++
 include/block/block.h |  7 +++++++
 2 files changed, 23 insertions(+)

diff --git a/block/io.c b/block/io.c
index fdf7080..099bddb 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1991,6 +1991,22 @@ int bdrv_readv_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos)
 /**************************************************************/
 /* async I/Os */
 
+BlockAIOCB *bdrv_aio_preadv(BdrvChild *child, int64_t offset,
+                            QEMUIOVector *qiov, unsigned int bytes,
+                            BlockCompletionFunc *cb, void *opaque)
+{
+    assert(bytes == qiov->size);
+    return bdrv_co_aio_prw_vector(child, offset, qiov, 0, cb, opaque, false);
+}
+
+BlockAIOCB *bdrv_aio_pwritev(BdrvChild *child, int64_t offset,
+                             QEMUIOVector *qiov, unsigned int bytes,
+                             BlockCompletionFunc *cb, void *opaque)
+{
+    assert(bytes == qiov->size);
+    return bdrv_co_aio_prw_vector(child, offset, qiov, 0, cb, opaque, true);
+}
+
 BlockAIOCB *bdrv_aio_readv(BdrvChild *child, int64_t sector_num,
                            QEMUIOVector *qiov, int nb_sectors,
                            BlockCompletionFunc *cb, void *opaque)
diff --git a/include/block/block.h b/include/block/block.h
index e18233a..6728219 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -305,6 +305,13 @@ BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs,
                                         const char *node_name, Error **errp);
 
 /* async block I/O */
+
+BlockAIOCB *bdrv_aio_preadv(BdrvChild *child, int64_t offset,
+                            QEMUIOVector *qiov, unsigned int bytes,
+                            BlockCompletionFunc *cb, void *opaque);
+BlockAIOCB *bdrv_aio_pwritev(BdrvChild *child, int64_t offset,
+                             QEMUIOVector *qiov, unsigned int bytes,
+                             BlockCompletionFunc *cb, void *opaque);
 BlockAIOCB *bdrv_aio_readv(BdrvChild *child, int64_t sector_num,
                            QEMUIOVector *iov, int nb_sectors,
                            BlockCompletionFunc *cb, void *opaque);
-- 
2.10.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v1 02/18] block/pcache: empty pcache driver filter
  2016-11-15  6:36 [Qemu-devel] [PATCH v1 00/18] I/O prefetch cache Pavel Butsykin
  2016-11-15  6:36 ` [Qemu-devel] [PATCH v1 01/18] block/io: add bdrv_aio_{preadv, pwritev} Pavel Butsykin
@ 2016-11-15  6:36 ` Pavel Butsykin
  2016-11-23 15:15   ` Kevin Wolf
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 03/18] util/rbtree: add rbtree from linux kernel Pavel Butsykin
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 32+ messages in thread
From: Pavel Butsykin @ 2016-11-15  6:36 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: den, famz, stefanha, mreitz, kwolf, eblake

The basic version of pcache driver for easy preparation of a patch set.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/Makefile.objs |   1 +
 block/pcache.c      | 144 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 145 insertions(+)
 create mode 100644 block/pcache.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 7d4031d..c60f680 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -4,6 +4,7 @@ block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
 block-obj-y += qed-check.o
 block-obj-y += vhdx.o vhdx-endian.o vhdx-log.o
 block-obj-y += quorum.o
+block-obj-y += pcache.o
 block-obj-y += parallels.o blkdebug.o blkverify.o blkreplay.o
 block-obj-y += block-backend.o snapshot.o qapi.o
 block-obj-$(CONFIG_WIN32) += raw-win32.o win32-aio.o
diff --git a/block/pcache.c b/block/pcache.c
new file mode 100644
index 0000000..59461df
--- /dev/null
+++ b/block/pcache.c
@@ -0,0 +1,144 @@
+/*
+ * Prefetch cache driver filter
+ *
+ * Copyright (C) 2015-2016 Parallels IP Holdings GmbH. All rights reserved.
+ *
+ * Author: Pavel Butsykin <pbutsykin@virtuozzo.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "block/block_int.h"
+#include "qapi/error.h"
+#include "qapi/qmp/qstring.h"
+
+
+static QemuOptsList runtime_opts = {
+    .name = "pcache",
+    .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head),
+    .desc = {
+        {
+            .name = "x-image",
+            .type = QEMU_OPT_STRING,
+            .help = "[internal use only, will be removed]",
+        },
+        { /* end of list */ }
+    },
+};
+
+typedef struct PCacheAIOCB {
+    Coroutine *co;
+    int ret;
+} PCacheAIOCB;
+
+static void pcache_aio_cb(void *opaque, int ret)
+{
+    PCacheAIOCB *acb = opaque;
+
+    acb->ret = ret;
+    qemu_coroutine_enter(acb->co);
+}
+
+static coroutine_fn int pcache_co_preadv(BlockDriverState *bs, uint64_t offset,
+                                         uint64_t bytes, QEMUIOVector *qiov,
+                                         int flags)
+{
+    PCacheAIOCB acb = {
+        .co = qemu_coroutine_self(),
+    };
+
+    bdrv_aio_preadv(bs->file, offset, qiov, bytes, pcache_aio_cb, &acb);
+
+    qemu_coroutine_yield();
+
+    return acb.ret;
+}
+
+static coroutine_fn int pcache_co_pwritev(BlockDriverState *bs, uint64_t offset,
+                                          uint64_t bytes, QEMUIOVector *qiov,
+                                          int flags)
+{
+    PCacheAIOCB acb = {
+        .co = qemu_coroutine_self(),
+    };
+
+    bdrv_aio_pwritev(bs->file, offset, qiov, bytes, pcache_aio_cb, &acb);
+
+    qemu_coroutine_yield();
+
+    return acb.ret;
+}
+
+static int pcache_file_open(BlockDriverState *bs, QDict *options, int flags,
+                            Error **errp)
+{
+    QemuOpts *opts;
+    Error *local_err = NULL;
+    int ret = 0;
+
+    opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
+    qemu_opts_absorb_qdict(opts, options, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        ret = -EINVAL;
+        goto fail;
+    }
+
+    assert(bs->file == NULL);
+    bs->file = bdrv_open_child(qemu_opt_get(opts, "x-image"), options,
+                               "image", bs, &child_format, false, &local_err);
+    if (local_err) {
+        ret = -EINVAL;
+        error_propagate(errp, local_err);
+    }
+fail:
+    qemu_opts_del(opts);
+    return ret;
+}
+
+static void pcache_close(BlockDriverState *bs)
+{
+}
+
+static void pcache_parse_filename(const char *filename, QDict *options,
+                                  Error **errp)
+{
+    qdict_put(options, "x-image", qstring_from_str(filename));
+}
+
+static int64_t pcache_getlength(BlockDriverState *bs)
+{
+    return bdrv_getlength(bs->file->bs);
+}
+
+static bool pcache_recurse_is_first_non_filter(BlockDriverState *bs,
+                                               BlockDriverState *candidate)
+{
+    return bdrv_recurse_is_first_non_filter(bs->file->bs, candidate);
+}
+
+static BlockDriver bdrv_pcache = {
+    .format_name                        = "pcache",
+    .protocol_name                      = "pcache",
+    .instance_size                      = 0,
+
+    .bdrv_parse_filename                = pcache_parse_filename,
+    .bdrv_file_open                     = pcache_file_open,
+    .bdrv_close                         = pcache_close,
+    .bdrv_getlength                     = pcache_getlength,
+
+    .bdrv_co_preadv                     = pcache_co_preadv,
+    .bdrv_co_pwritev                    = pcache_co_pwritev,
+
+    .is_filter                          = true,
+    .bdrv_recurse_is_first_non_filter   = pcache_recurse_is_first_non_filter,
+};
+
+static void bdrv_cache_init(void)
+{
+    bdrv_register(&bdrv_pcache);
+}
+
+block_init(bdrv_cache_init);
-- 
2.10.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v1 03/18] util/rbtree: add rbtree from linux kernel
  2016-11-15  6:36 [Qemu-devel] [PATCH v1 00/18] I/O prefetch cache Pavel Butsykin
  2016-11-15  6:36 ` [Qemu-devel] [PATCH v1 01/18] block/io: add bdrv_aio_{preadv, pwritev} Pavel Butsykin
  2016-11-15  6:36 ` [Qemu-devel] [PATCH v1 02/18] block/pcache: empty pcache driver filter Pavel Butsykin
@ 2016-11-15  6:37 ` Pavel Butsykin
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 04/18] util/rbcache: range-based cache core Pavel Butsykin
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 32+ messages in thread
From: Pavel Butsykin @ 2016-11-15  6:37 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: den, famz, stefanha, mreitz, kwolf, eblake

Why don't we use rbtree from glib? We need  pointer to the parent node.
For optimal implementation storing of cached chunks in the rbtree need to
get next and previous nodes and content of parent node is very useful for
effective implementation of these functions. In this implementation of
rbtree (unlike rbtree of glib) the node contains a pointer to parent node.
Moreover, this rbtree allows more flexibility to work with an algorithm
because to use rbtrees you'll have to implement your own insert and search
cores. This will avoid us to use callbacks and to drop drammatically
performances.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 MAINTAINERS                     |   7 +
 include/qemu/rbtree.h           | 107 ++++++++
 include/qemu/rbtree_augmented.h | 235 +++++++++++++++++
 util/Makefile.objs              |   1 +
 util/rbtree.c                   | 570 ++++++++++++++++++++++++++++++++++++++++
 5 files changed, 920 insertions(+)
 create mode 100644 include/qemu/rbtree.h
 create mode 100644 include/qemu/rbtree_augmented.h
 create mode 100644 util/rbtree.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 9b7e846..ddf797b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1358,6 +1358,13 @@ F: include/qemu/throttle.h
 F: util/throttle.c
 L: qemu-block@nongnu.org
 
+Red Black Trees
+M: Denis V. Lunev <den@openvz.org>
+S: Supported
+F: include/qemu/rbtree.h
+F: include/qemu/rbtree_augmented.h
+F: util/rbtree.c
+
 UUID
 M: Fam Zheng <famz@redhat.com>
 S: Supported
diff --git a/include/qemu/rbtree.h b/include/qemu/rbtree.h
new file mode 100644
index 0000000..d2e3fdd
--- /dev/null
+++ b/include/qemu/rbtree.h
@@ -0,0 +1,107 @@
+/*
+  Red Black Trees
+  (C) 1999  Andrea Arcangeli <andrea@suse.de>
+
+  This program is free software; you can redistribute it and/or modify
+  it under the terms of the GNU General Public License as published by
+  the Free Software Foundation; either version 2 of the License, or
+  (at your option) any later version.
+
+  This program is distributed in the hope that it will be useful,
+  but WITHOUT ANY WARRANTY; without even the implied warranty of
+  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+  GNU General Public License for more details.
+
+  You should have received a copy of the GNU General Public License
+  along with this program; if not, write to the Free Software
+  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+
+  include/qemu/rbtree.h
+
+  To use rbtrees you'll have to implement your own insert and search cores.
+  This will avoid us to use callbacks and to drop drammatically performances.
+  I know it's not the cleaner way,  but in C (not in C++) to get
+  performances and genericity...
+*/
+
+#ifndef QEMU_RBTREE_H
+#define QEMU_RBTREE_H
+
+#include <unistd.h>
+#include <stdint.h>
+#include <stdbool.h>
+
+struct RbNode {
+    uintptr_t __rb_parent_color;
+    struct RbNode *rb_right;
+    struct RbNode *rb_left;
+} __attribute__((aligned(sizeof(uintptr_t))));
+    /* The alignment might seem pointless, but allegedly CRIS needs it */
+
+struct RbRoot {
+    struct RbNode *rb_node;
+};
+
+
+#define RB_PARENT(r) ((struct RbNode *)((r)->__rb_parent_color & ~3))
+
+#define RB_ROOT (struct RbRoot) { NULL, }
+#define RB_ENTRY(ptr, type, member) container_of(ptr, type, member)
+
+#define RB_EMPTY_ROOT(root) ((root)->rb_node == NULL)
+
+/* 'empty' nodes are nodes that are known not to be inserted in an rbtree */
+#define RB_EMPTY_NODE(node)  \
+    ((node)->__rb_parent_color == (uintptr_t)(node))
+#define RB_CLEAR_NODE(node)  \
+    ((node)->__rb_parent_color = (uintptr_t)(node))
+
+
+extern void rb_insert_color(struct RbNode *, struct RbRoot *);
+extern void rb_erase(struct RbNode *, struct RbRoot *);
+
+
+/* Find logical next and previous nodes in a tree */
+extern struct RbNode *rb_next(const struct RbNode *);
+extern struct RbNode *rb_prev(const struct RbNode *);
+extern struct RbNode *rb_first(const struct RbRoot *);
+extern struct RbNode *rb_last(const struct RbRoot *);
+
+/* Postorder iteration - always visit the parent after its children */
+extern struct RbNode *rb_first_postorder(const struct RbRoot *);
+extern struct RbNode *rb_next_postorder(const struct RbNode *);
+
+/* Fast replacement of a single node without remove/rebalance/add/rebalance */
+extern void rb_replace_node(struct RbNode *victim, struct RbNode *new,
+                            struct RbRoot *root);
+
+static inline void rb_link_node(struct RbNode *node, struct RbNode *parent,
+                                struct RbNode **rb_link)
+{
+    node->__rb_parent_color = (uintptr_t)parent;
+    node->rb_left = node->rb_right = NULL;
+
+    *rb_link = node;
+}
+
+#define RB_ENTRY_SAFE(ptr, type, member)                 \
+    ({ typeof(ptr) ____ptr = (ptr);                      \
+       ____ptr ? rb_entry(____ptr, type, member) : NULL; \
+    })
+
+/**
+ * rbtree_postorder_for_each_entry_safe - iterate over rb_root in post order of
+ * given type safe against removal of rb_node entry
+ *
+ * @pos:   the 'type *' to use as a loop cursor.
+ * @n:     another 'type *' to use as temporary storage
+ * @root:  'rb_root *' of the rbtree.
+ * @field: the name of the rb_node field within 'type'.
+ */
+#define RBTREE_POSTORDER_FOR_EACH_ENTRY_SAFE(pos, n, root, field)            \
+    for (pos = rb_entry_safe(rb_first_postorder(root), typeof(*pos), field); \
+         pos && ({ n = rb_entry_safe(rb_next_postorder(&pos->field),         \
+         typeof(*pos), field); 1; });                                        \
+         pos = n)
+
+#endif  /* QEMU_RBTREE_H */
diff --git a/include/qemu/rbtree_augmented.h b/include/qemu/rbtree_augmented.h
new file mode 100644
index 0000000..f218d31
--- /dev/null
+++ b/include/qemu/rbtree_augmented.h
@@ -0,0 +1,235 @@
+/*
+  Red Black Trees
+  (C) 1999  Andrea Arcangeli <andrea@suse.de>
+  (C) 2002  David Woodhouse <dwmw2@infradead.org>
+  (C) 2012  Michel Lespinasse <walken@google.com>
+
+  This program is free software; you can redistribute it and/or modify
+  it under the terms of the GNU General Public License as published by
+  the Free Software Foundation; either version 2 of the License, or
+  (at your option) any later version.
+
+  This program is distributed in the hope that it will be useful,
+  but WITHOUT ANY WARRANTY; without even the implied warranty of
+  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+  GNU General Public License for more details.
+
+  You should have received a copy of the GNU General Public License
+  along with this program; if not, write to the Free Software
+  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+
+  include/qemu/rbtree_augmented.h
+*/
+
+#ifndef QEMU_RBTREE_AUGMENTED_H
+#define QEMU_RBTREE_AUGMENTED_H
+
+#include "qemu/compiler.h"
+#include "qemu/rbtree.h"
+
+/*
+ * Please note - only struct RbAugmentCallbacks and the prototypes for
+ * rb_insert_augmented() and rb_erase_augmented() are intended to be public.
+ * The rest are implementation details you are not expected to depend on.
+ */
+
+struct RbAugmentCallbacks {
+    void (*propagate)(struct RbNode *node, struct RbNode *stop);
+    void (*copy)(struct RbNode *old, struct RbNode *new);
+    void (*rotate)(struct RbNode *old, struct RbNode *new);
+};
+
+extern void __rb_insert_augmented(struct RbNode *node, struct RbRoot *root,
+    void (*augment_rotate)(struct RbNode *old, struct RbNode *new));
+/*
+ * Fixup the rbtree and update the augmented information when rebalancing.
+ *
+ * On insertion, the user must update the augmented information on the path
+ * leading to the inserted node, then call rb_link_node() as usual and
+ * rb_augment_inserted() instead of the usual rb_insert_color() call.
+ * If rb_augment_inserted() rebalances the rbtree, it will callback into
+ * a user provided function to update the augmented information on the
+ * affected subtrees.
+ */
+static inline void
+rb_insert_augmented(struct RbNode *node, struct RbRoot *root,
+                    const struct RbAugmentCallbacks *augment)
+{
+    __rb_insert_augmented(node, root, augment->rotate);
+}
+
+#define RB_DECLARE_CALLBACKS(rbstatic, rbname, rbstruct, rbfield, \
+                             rbtype, rbaugmented, rbcompute)      \
+static inline void                                                \
+rbname ## _propagate(struct RbNode *rb, struct RbNode *stop)      \
+{                                                                 \
+    while (rb != stop) {                                          \
+        rbstruct *node = rb_entry(rb, rbstruct, rbfield);         \
+        rbtype augmented = rbcompute(node);                       \
+        if (node->rbaugmented == augmented) {                     \
+            break;                                                \
+        }                                                         \
+        node->rbaugmented = augmented;                            \
+        rb = rb_parent(&node->rbfield);                           \
+    }                                                             \
+}                                                                 \
+static inline void                                                \
+rbname ## _copy(struct RbNode *rb_old, struct RbNode *rb_new)     \
+{                                                                 \
+    rbstruct *old = rb_entry(rb_old, rbstruct, rbfield);          \
+    rbstruct *new = rb_entry(rb_new, rbstruct, rbfield);          \
+    new->rbaugmented = old->rbaugmented;                          \
+}                                                                 \
+static void                                                       \
+rbname ## _rotate(struct RbNode *rb_old, struct RbNode *rb_new)   \
+{                                                                 \
+    rbstruct *old = rb_entry(rb_old, rbstruct, rbfield);          \
+    rbstruct *new = rb_entry(rb_new, rbstruct, rbfield);          \
+    new->rbaugmented = old->rbaugmented;                          \
+    old->rbaugmented = rbcompute(old);                            \
+}                                                                 \
+rbstatic const struct RbAugmentCallbacks rbname = {               \
+    rbname ## _propagate, rbname ## _copy, rbname ## _rotate      \
+};
+
+
+#define RB_RED   0
+#define RB_BLACK 1
+
+#define __RB_PARENT(pc)    ((struct RbNode *)(pc & ~3))
+
+#define __RB_COLOR(pc)     ((pc) & 1)
+#define __RB_IS_BLACK(pc)  __RB_COLOR(pc)
+#define __RB_IS_RED(pc)    (!__RB_COLOR(pc))
+#define RB_COLOR(rb)       __RB_COLOR((rb)->__rb_parent_color)
+#define RB_IS_RED(rb)      __RB_IS_RED((rb)->__rb_parent_color)
+#define RB_IS_BLACK(rb)    __RB_IS_BLACK((rb)->__rb_parent_color)
+
+static inline void rb_set_parent(struct RbNode *rb, struct RbNode *p)
+{
+    rb->__rb_parent_color = RB_COLOR(rb) | (uintptr_t)p;
+}
+
+static inline void rb_set_parent_color(struct RbNode *rb,
+                                       struct RbNode *p, int color)
+{
+    rb->__rb_parent_color = (uintptr_t)p | color;
+}
+
+static inline void
+__rb_change_child(struct RbNode *old, struct RbNode *new,
+                  struct RbNode *parent, struct RbRoot *root)
+{
+    if (parent) {
+        if (parent->rb_left == old) {
+            parent->rb_left = new;
+        } else {
+            parent->rb_right = new;
+        }
+    } else {
+        root->rb_node = new;
+    }
+}
+
+extern void __rb_erase_color(struct RbNode *parent, struct RbRoot *root,
+    void (*augment_rotate)(struct RbNode *old, struct RbNode *new));
+
+static inline struct RbNode *
+__rb_erase_augmented(struct RbNode *node, struct RbRoot *root,
+                     const struct RbAugmentCallbacks *augment)
+{
+    struct RbNode *child = node->rb_right, *tmp = node->rb_left;
+    struct RbNode *parent, *rebalance;
+    uintptr_t pc;
+
+    if (!tmp) {
+        /*
+         * Case 1: node to erase has no more than 1 child (easy!)
+         *
+         * Note that if there is one child it must be red due to 5)
+         * and node must be black due to 4). We adjust colors locally
+         * so as to bypass __rb_erase_color() later on.
+         */
+        pc = node->__rb_parent_color;
+        parent = __RB_PARENT(pc);
+        __rb_change_child(node, child, parent, root);
+        if (child) {
+            child->__rb_parent_color = pc;
+            rebalance = NULL;
+        } else {
+            rebalance = __RB_IS_BLACK(pc) ? parent : NULL;
+        }
+        tmp = parent;
+    } else if (!child) {
+        /* Still case 1, but this time the child is node->rb_left */
+        tmp->__rb_parent_color = pc = node->__rb_parent_color;
+        parent = __RB_PARENT(pc);
+        __rb_change_child(node, tmp, parent, root);
+        rebalance = NULL;
+        tmp = parent;
+    } else {
+        struct RbNode *successor = child, *child2;
+        tmp = child->rb_left;
+        if (!tmp) {
+            /*
+             * Case 2: node's successor is its right child
+             *
+             *    (n)          (s)
+             *    / \          / \
+             *  (x) (s)  ->  (x) (c)
+             *        \
+             *        (c)
+             */
+            parent = successor;
+            child2 = successor->rb_right;
+            augment->copy(node, successor);
+        } else {
+            /*
+             * Case 3: node's successor is leftmost under
+             * node's right child subtree
+             *
+             *    (n)          (s)
+             *    / \          / \
+             *  (x) (y)  ->  (x) (y)
+             *      /            /
+             *    (p)          (p)
+             *    /            /
+             *  (s)          (c)
+             *    \
+             *    (c)
+             */
+            do {
+                parent = successor;
+                successor = tmp;
+                tmp = tmp->rb_left;
+            } while (tmp);
+            parent->rb_left = child2 = successor->rb_right;
+            successor->rb_right = child;
+            rb_set_parent(child, successor);
+            augment->copy(node, successor);
+            augment->propagate(parent, successor);
+        }
+
+        successor->rb_left = tmp = node->rb_left;
+        rb_set_parent(tmp, successor);
+
+        pc = node->__rb_parent_color;
+        tmp = __RB_PARENT(pc);
+        __rb_change_child(node, successor, tmp, root);
+        if (child2) {
+            successor->__rb_parent_color = pc;
+            rb_set_parent_color(child2, parent, RB_BLACK);
+            rebalance = NULL;
+        } else {
+            unsigned long pc2 = successor->__rb_parent_color;
+            successor->__rb_parent_color = pc;
+            rebalance = __RB_IS_BLACK(pc2) ? parent : NULL;
+        }
+        tmp = successor;
+    }
+
+    augment->propagate(tmp, NULL);
+    return rebalance;
+}
+
+#endif /* QEMU_RBTREE_AUGMENTED_H */
diff --git a/util/Makefile.objs b/util/Makefile.objs
index 36c7dcc..727a567 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -37,3 +37,4 @@ util-obj-y += log.o
 util-obj-y += qdist.o
 util-obj-y += qht.o
 util-obj-y += range.o
+util-obj-y += rbtree.o
diff --git a/util/rbtree.c b/util/rbtree.c
new file mode 100644
index 0000000..704dcea
--- /dev/null
+++ b/util/rbtree.c
@@ -0,0 +1,570 @@
+/*
+  Red Black Trees
+  (C) 1999  Andrea Arcangeli <andrea@suse.de>
+  (C) 2002  David Woodhouse <dwmw2@infradead.org>
+  (C) 2012  Michel Lespinasse <walken@google.com>
+
+  This program is free software; you can redistribute it and/or modify
+  it under the terms of the GNU General Public License as published by
+  the Free Software Foundation; either version 2 of the License, or
+  (at your option) any later version.
+
+  This program is distributed in the hope that it will be useful,
+  but WITHOUT ANY WARRANTY; without even the implied warranty of
+  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+  GNU General Public License for more details.
+
+  You should have received a copy of the GNU General Public License
+  along with this program; if not, write to the Free Software
+  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+*/
+
+#include "qemu/rbtree_augmented.h"
+
+/*
+ * red-black trees properties:  http://en.wikipedia.org/wiki/Rbtree
+ *
+ *  1) A node is either red or black
+ *  2) The root is black
+ *  3) All leaves (NULL) are black
+ *  4) Both children of every red node are black
+ *  5) Every simple path from root to leaves contains the same number
+ *     of black nodes.
+ *
+ *  4 and 5 give the O(log n) guarantee, since 4 implies you cannot have two
+ *  consecutive red nodes in a path and every red node is therefore followed by
+ *  a black. So if B is the number of black nodes on every simple path (as per
+ *  5), then the longest possible path due to 4 is 2B.
+ *
+ *  We shall indicate color with case, where black nodes are uppercase and red
+ *  nodes will be lowercase. Unknown color nodes shall be drawn as red within
+ *  parentheses and have some accompanying text comment.
+ */
+
+static inline void rb_set_black(struct RbNode *rb)
+{
+    rb->__rb_parent_color |= RB_BLACK;
+}
+
+static inline struct RbNode *rb_red_parent(struct RbNode *red)
+{
+    return (struct RbNode *)red->__rb_parent_color;
+}
+
+/*
+ * Helper function for rotations:
+ * - old's parent and color get assigned to new
+ * - old gets assigned new as a parent and 'color' as a color.
+ */
+static inline void
+__rb_rotate_set_parents(struct RbNode *old, struct RbNode *new,
+                        struct RbRoot *root, int color)
+{
+    struct RbNode *parent = RB_PARENT(old);
+    new->__rb_parent_color = old->__rb_parent_color;
+    rb_set_parent_color(old, new, color);
+    __rb_change_child(old, new, parent, root);
+}
+
+static inline void
+__rb_insert(struct RbNode *node, struct RbRoot *root,
+            void (*augment_rotate)(struct RbNode *old, struct RbNode *new))
+{
+    struct RbNode *parent = rb_red_parent(node), *gparent, *tmp;
+
+    while (true) {
+        /*
+        * Loop invariant: node is red
+        *
+        * If there is a black parent, we are done.
+        * Otherwise, take some corrective action as we don't
+        * want a red root or two consecutive red nodes.
+        */
+        if (!parent) {
+            rb_set_parent_color(node, NULL, RB_BLACK);
+            break;
+        } else if (RB_IS_BLACK(parent)) {
+            break;
+        }
+
+        gparent = rb_red_parent(parent);
+
+        tmp = gparent->rb_right;
+        if (parent != tmp) {    /* parent == gparent->rb_left */
+            if (tmp && RB_IS_RED(tmp)) {
+                /*
+                 * Case 1 - color flips
+                 *
+                 *       G            g
+                 *      / \          / \
+                 *     p   u  -->   P   U
+                 *    /            /
+                 *   n            n
+                 *
+                 * However, since g's parent might be red, and
+                 * 4) does not allow this, we need to recurse
+                 * at g.
+                 */
+                rb_set_parent_color(tmp, gparent, RB_BLACK);
+                rb_set_parent_color(parent, gparent, RB_BLACK);
+                node = gparent;
+                parent = RB_PARENT(node);
+                rb_set_parent_color(node, parent, RB_RED);
+                continue;
+            }
+
+            tmp = parent->rb_right;
+            if (node == tmp) {
+                /*
+                 * Case 2 - left rotate at parent
+                 *
+                 *      G             G
+                 *     / \           / \
+                 *    p   U  -->    n   U
+                 *     \           /
+                 *      n         p
+                 *
+                 * This still leaves us in violation of 4), the
+                 * continuation into Case 3 will fix that.
+                 */
+                parent->rb_right = tmp = node->rb_left;
+                node->rb_left = parent;
+                if (tmp) {
+                    rb_set_parent_color(tmp, parent, RB_BLACK);
+                }
+                rb_set_parent_color(parent, node, RB_RED);
+                augment_rotate(parent, node);
+                parent = node;
+                tmp = node->rb_right;
+            }
+
+            /*
+             * Case 3 - right rotate at gparent
+             *
+             *        G           P
+             *       / \         / \
+             *      p   U  -->  n   g
+             *     /                 \
+             *    n                   U
+             */
+            gparent->rb_left = tmp;  /* == parent->rb_right */
+            parent->rb_right = gparent;
+            if (tmp) {
+                rb_set_parent_color(tmp, gparent, RB_BLACK);
+            }
+            __rb_rotate_set_parents(gparent, parent, root, RB_RED);
+            augment_rotate(gparent, parent);
+            break;
+        } else {
+            tmp = gparent->rb_left;
+            if (tmp && RB_IS_RED(tmp)) {
+                /* Case 1 - color flips */
+                rb_set_parent_color(tmp, gparent, RB_BLACK);
+                rb_set_parent_color(parent, gparent, RB_BLACK);
+                node = gparent;
+                parent = RB_PARENT(node);
+                rb_set_parent_color(node, parent, RB_RED);
+                continue;
+            }
+
+            tmp = parent->rb_left;
+            if (node == tmp) {
+                /* Case 2 - right rotate at parent */
+                parent->rb_left = tmp = node->rb_right;
+                node->rb_right = parent;
+                if (tmp) {
+                    rb_set_parent_color(tmp, parent, RB_BLACK);
+                }
+                rb_set_parent_color(parent, node, RB_RED);
+                augment_rotate(parent, node);
+                parent = node;
+                tmp = node->rb_left;
+            }
+
+            /* Case 3 - left rotate at gparent */
+            gparent->rb_right = tmp;  /* == parent->rb_left */
+            parent->rb_left = gparent;
+            if (tmp) {
+                rb_set_parent_color(tmp, gparent, RB_BLACK);
+            }
+            __rb_rotate_set_parents(gparent, parent, root, RB_RED);
+            augment_rotate(gparent, parent);
+            break;
+        }
+    }
+}
+
+/*
+ * Inline version for rb_erase() use - we want to be able to inline
+ * and eliminate the dummy_rotate callback there
+ */
+static inline void
+____rb_erase_color(struct RbNode *parent, struct RbRoot *root,
+                   void (*augment_rotate)(struct RbNode *old,
+                                          struct RbNode *new))
+{
+    struct RbNode *node = NULL, *sibling, *tmp1, *tmp2;
+
+    while (true) {
+        /*
+         * Loop invariants:
+         * - node is black (or NULL on first iteration)
+         * - node is not the root (parent is not NULL)
+         * - All leaf paths going through parent and node have a
+         *   black node count that is 1 lower than other leaf paths.
+         */
+        sibling = parent->rb_right;
+        if (node != sibling) {    /* node == parent->rb_left */
+            if (RB_IS_RED(sibling)) {
+                /*
+                 * Case 1 - left rotate at parent
+                 *
+                 *     P               S
+                 *    / \             / \
+                 *   N   s    -->    p   Sr
+                 *      / \         / \
+                 *     Sl  Sr      N   Sl
+                 */
+                parent->rb_right = tmp1 = sibling->rb_left;
+                sibling->rb_left = parent;
+                rb_set_parent_color(tmp1, parent, RB_BLACK);
+                __rb_rotate_set_parents(parent, sibling, root,
+                            RB_RED);
+                augment_rotate(parent, sibling);
+                sibling = tmp1;
+            }
+            tmp1 = sibling->rb_right;
+            if (!tmp1 || RB_IS_BLACK(tmp1)) {
+                tmp2 = sibling->rb_left;
+                if (!tmp2 || RB_IS_BLACK(tmp2)) {
+                    /*
+                     * Case 2 - sibling color flip
+                     * (p could be either color here)
+                     *
+                     *    (p)           (p)
+                     *    / \           / \
+                     *   N   S    -->  N   s
+                     *      / \           / \
+                     *     Sl  Sr        Sl  Sr
+                     *
+                     * This leaves us violating 5) which
+                     * can be fixed by flipping p to black
+                     * if it was red, or by recursing at p.
+                     * p is red when coming from Case 1.
+                     */
+                    rb_set_parent_color(sibling, parent,
+                                RB_RED);
+                    if (RB_IS_RED(parent)) {
+                        rb_set_black(parent);
+                    } else {
+                        node = parent;
+                        parent = RB_PARENT(node);
+                        if (parent) {
+                            continue;
+                        }
+                    }
+                    break;
+                }
+                /*
+                 * Case 3 - right rotate at sibling
+                 * (p could be either color here)
+                 *
+                 *   (p)           (p)
+                 *   / \           / \
+                 *  N   S    -->  N   Sl
+                 *     / \             \
+                 *    sl  Sr            s
+                 *                       \
+                 *                        Sr
+                 */
+                sibling->rb_left = tmp1 = tmp2->rb_right;
+                tmp2->rb_right = sibling;
+                parent->rb_right = tmp2;
+                if (tmp1) {
+                    rb_set_parent_color(tmp1, sibling, RB_BLACK);
+                }
+                augment_rotate(sibling, tmp2);
+                tmp1 = sibling;
+                sibling = tmp2;
+            }
+            /*
+             * Case 4 - left rotate at parent + color flips
+             * (p and sl could be either color here.
+             *  After rotation, p becomes black, s acquires
+             *  p's color, and sl keeps its color)
+             *
+             *      (p)             (s)
+             *      / \             / \
+             *     N   S     -->   P   Sr
+             *        / \         / \
+             *      (sl) sr      N  (sl)
+             */
+            parent->rb_right = tmp2 = sibling->rb_left;
+            sibling->rb_left = parent;
+            rb_set_parent_color(tmp1, sibling, RB_BLACK);
+            if (tmp2) {
+                rb_set_parent(tmp2, parent);
+            }
+            __rb_rotate_set_parents(parent, sibling, root,
+                        RB_BLACK);
+            augment_rotate(parent, sibling);
+            break;
+        } else {
+            sibling = parent->rb_left;
+            if (RB_IS_RED(sibling)) {
+                /* Case 1 - right rotate at parent */
+                parent->rb_left = tmp1 = sibling->rb_right;
+                sibling->rb_right = parent;
+                rb_set_parent_color(tmp1, parent, RB_BLACK);
+                __rb_rotate_set_parents(parent, sibling, root,
+                            RB_RED);
+                augment_rotate(parent, sibling);
+                sibling = tmp1;
+            }
+            tmp1 = sibling->rb_left;
+            if (!tmp1 || RB_IS_BLACK(tmp1)) {
+                tmp2 = sibling->rb_right;
+                if (!tmp2 || RB_IS_BLACK(tmp2)) {
+                    /* Case 2 - sibling color flip */
+                    rb_set_parent_color(sibling, parent,
+                                RB_RED);
+                    if (RB_IS_RED(parent)) {
+                        rb_set_black(parent);
+                    } else {
+                        node = parent;
+                        parent = RB_PARENT(node);
+                        if (parent) {
+                            continue;
+                        }
+                    }
+                    break;
+                }
+                /* Case 3 - right rotate at sibling */
+                sibling->rb_right = tmp1 = tmp2->rb_left;
+                tmp2->rb_left = sibling;
+                parent->rb_left = tmp2;
+                if (tmp1) {
+                    rb_set_parent_color(tmp1, sibling,
+                                RB_BLACK);
+                }
+                augment_rotate(sibling, tmp2);
+                tmp1 = sibling;
+                sibling = tmp2;
+            }
+            /* Case 4 - left rotate at parent + color flips */
+            parent->rb_left = tmp2 = sibling->rb_right;
+            sibling->rb_right = parent;
+            rb_set_parent_color(tmp1, sibling, RB_BLACK);
+            if (tmp2) {
+                rb_set_parent(tmp2, parent);
+            }
+            __rb_rotate_set_parents(parent, sibling, root,
+                        RB_BLACK);
+            augment_rotate(parent, sibling);
+            break;
+        }
+    }
+}
+
+/* Non-inline version for rb_erase_augmented() use */
+void __rb_erase_color(struct RbNode *parent, struct RbRoot *root,
+                      void (*augment_rotate)(struct RbNode *old,
+                                             struct RbNode *new))
+{
+    ____rb_erase_color(parent, root, augment_rotate);
+}
+
+/*
+ * Non-augmented rbtree manipulation functions.
+ *
+ * We use dummy augmented callbacks here, and have the compiler optimize them
+ * out of the rb_insert_color() and rb_erase() function definitions.
+ */
+
+static inline void dummy_propagate(struct RbNode *node, struct RbNode *stop) {}
+static inline void dummy_copy(struct RbNode *old, struct RbNode *new) {}
+static inline void dummy_rotate(struct RbNode *old, struct RbNode *new) {}
+
+static const struct RbAugmentCallbacks dummy_callbacks = {
+    dummy_propagate, dummy_copy, dummy_rotate
+};
+
+void rb_insert_color(struct RbNode *node, struct RbRoot *root)
+{
+    __rb_insert(node, root, dummy_rotate);
+}
+
+void rb_erase(struct RbNode *node, struct RbRoot *root)
+{
+    struct RbNode *rebalance;
+    rebalance = __rb_erase_augmented(node, root, &dummy_callbacks);
+    if (rebalance) {
+        ____rb_erase_color(rebalance, root, dummy_rotate);
+    }
+}
+
+/*
+ * Augmented rbtree manipulation functions.
+ *
+ * This instantiates the same __always_inline functions as in the non-augmented
+ * case, but this time with user-defined callbacks.
+ */
+
+void __rb_insert_augmented(struct RbNode *node, struct RbRoot *root,
+                           void (*augment_rotate)(struct RbNode *old,
+                                                  struct RbNode *new))
+{
+    __rb_insert(node, root, augment_rotate);
+}
+
+/*
+ * This function returns the first node (in sort order) of the tree.
+ */
+struct RbNode *rb_first(const struct RbRoot *root)
+{
+    struct RbNode    *n;
+
+    n = root->rb_node;
+    if (!n) {
+        return NULL;
+    }
+    while (n->rb_left) {
+        n = n->rb_left;
+    }
+    return n;
+}
+
+struct RbNode *rb_last(const struct RbRoot *root)
+{
+    struct RbNode    *n;
+
+    n = root->rb_node;
+    if (!n) {
+        return NULL;
+    }
+    while (n->rb_right) {
+        n = n->rb_right;
+    }
+    return n;
+}
+
+struct RbNode *rb_next(const struct RbNode *node)
+{
+    struct RbNode *parent;
+
+    if (RB_EMPTY_NODE(node)) {
+        return NULL;
+    }
+
+    /*
+     * If we have a right-hand child, go down and then left as far
+     * as we can.
+     */
+    if (node->rb_right) {
+        node = node->rb_right;
+        while (node->rb_left) {
+            node = node->rb_left;
+        }
+        return (struct RbNode *)node;
+    }
+
+    /*
+     * No right-hand children. Everything down and left is smaller than us,
+     * so any 'next' node must be in the general direction of our parent.
+     * Go up the tree; any time the ancestor is a right-hand child of its
+     * parent, keep going up. First time it's a left-hand child of its
+     * parent, said parent is our 'next' node.
+     */
+    while ((parent = RB_PARENT(node)) && node == parent->rb_right) {
+        node = parent;
+    }
+    return parent;
+}
+
+struct RbNode *rb_prev(const struct RbNode *node)
+{
+    struct RbNode *parent;
+
+    if (RB_EMPTY_NODE(node)) {
+        return NULL;
+    }
+
+    /*
+     * If we have a left-hand child, go down and then right as far
+     * as we can.
+     */
+    if (node->rb_left) {
+        node = node->rb_left;
+        while (node->rb_right) {
+            node = node->rb_right;
+        }
+        return (struct RbNode *)node;
+    }
+
+    /*
+     * No left-hand children. Go up till we find an ancestor which
+     * is a right-hand child of its parent.
+     */
+    while ((parent = RB_PARENT(node)) && node == parent->rb_left) {
+        node = parent;
+    }
+    return parent;
+}
+
+void rb_replace_node(struct RbNode *victim, struct RbNode *new,
+             struct RbRoot *root)
+{
+    struct RbNode *parent = RB_PARENT(victim);
+
+    /* Set the surrounding nodes to point to the replacement */
+    __rb_change_child(victim, new, parent, root);
+    if (victim->rb_left) {
+        rb_set_parent(victim->rb_left, new);
+    }
+    if (victim->rb_right) {
+        rb_set_parent(victim->rb_right, new);
+    }
+    /* Copy the pointers/colour from the victim to the replacement */
+    *new = *victim;
+}
+
+static struct RbNode *rb_left_deepest_node(const struct RbNode *node)
+{
+    for (;;) {
+        if (node->rb_left) {
+            node = node->rb_left;
+        } else if (node->rb_right) {
+            node = node->rb_right;
+        } else {
+            return (struct RbNode *)node;
+        }
+    }
+}
+
+struct RbNode *rb_next_postorder(const struct RbNode *node)
+{
+    const struct RbNode *parent;
+    if (!node) {
+        return NULL;
+    }
+    parent = RB_PARENT(node);
+
+    /* If we're sitting on node, we've already seen our children */
+    if (parent && node == parent->rb_left && parent->rb_right) {
+        /* If we are the parent's left node, go to the parent's right
+         * node then all the way down to the left */
+        return rb_left_deepest_node(parent->rb_right);
+    } else {
+        /* Otherwise we are the parent's right node, and the parent
+         * should be next */
+        return (struct RbNode *)parent;
+    }
+}
+
+struct RbNode *rb_first_postorder(const struct RbRoot *root)
+{
+    if (!root->rb_node) {
+        return NULL;
+    }
+    return rb_left_deepest_node(root->rb_node);
+}
-- 
2.10.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v1 04/18] util/rbcache: range-based cache core
  2016-11-15  6:36 [Qemu-devel] [PATCH v1 00/18] I/O prefetch cache Pavel Butsykin
                   ` (2 preceding siblings ...)
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 03/18] util/rbtree: add rbtree from linux kernel Pavel Butsykin
@ 2016-11-15  6:37 ` Pavel Butsykin
  2016-11-23 21:25   ` Kevin Wolf
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 05/18] tests/test-rbcache: add test cases Pavel Butsykin
                   ` (14 subsequent siblings)
  18 siblings, 1 reply; 32+ messages in thread
From: Pavel Butsykin @ 2016-11-15  6:37 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: den, famz, stefanha, mreitz, kwolf, eblake

RBCache provides functionality to cache the data from block devices
(basically). The range here is used as the main key for searching and storing
data. The cache is based on red-black trees, so basic operations search,
insert, delete are performed for O(log n).

It is important to note that QEMU usually does not require a data cache, but
in reality, there are already some cases where a cache of small amounts can
increase performance, so as the data structure was selected red-black trees,
this is a fairly simple data structure and show high efficiency on a small
number of elements. Therefore, when the minimum range is 512 bytes, the
recommended size of the cache memory no more than 8-16mb.  Also note
that this cache implementation allows to store ranges of different lengths
without alignment.

Generic cache core can easily be used to implement different caching policies at
the block level, such as read-ahed. Also it can be used in some special cases,
for example for caching data in qcow2 when sequential allocating writes to image
with backing file.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 MAINTAINERS            |   6 ++
 include/qemu/rbcache.h | 105 +++++++++++++++++++++
 util/Makefile.objs     |   1 +
 util/rbcache.c         | 246 +++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 358 insertions(+)
 create mode 100644 include/qemu/rbcache.h
 create mode 100644 util/rbcache.c

diff --git a/MAINTAINERS b/MAINTAINERS
index ddf797b..cb74802 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1365,6 +1365,12 @@ F: include/qemu/rbtree.h
 F: include/qemu/rbtree_augmented.h
 F: util/rbtree.c
 
+Range-Based Cache
+M: Denis V. Lunev <den@openvz.org>
+S: Supported
+F: include/qemu/rbcache.h
+F: util/rbcache.c
+
 UUID
 M: Fam Zheng <famz@redhat.com>
 S: Supported
diff --git a/include/qemu/rbcache.h b/include/qemu/rbcache.h
new file mode 100644
index 0000000..c8f0a9f
--- /dev/null
+++ b/include/qemu/rbcache.h
@@ -0,0 +1,105 @@
+/*
+ * QEMU Range-Based Cache core
+ *
+ * Copyright (C) 2015-2016 Parallels IP Holdings GmbH. All rights reserved.
+ *
+ * Author: Pavel Butsykin <pbutsykin@virtuozzo.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+
+#ifndef RBCACHE_H
+#define RBCACHE_H
+
+#include "qemu/rbtree.h"
+#include "qemu/queue.h"
+
+typedef struct RBCacheNode {
+    struct RbNode rb_node;
+    uint64_t offset;
+    uint64_t bytes;
+    QTAILQ_ENTRY(RBCacheNode) entry;
+} RBCacheNode;
+
+typedef struct RBCache RBCache;
+
+typedef RBCacheNode *RBNodeAlloc(uint64_t offset, uint64_t bytes, void *opaque);
+typedef void RBNodeFree(RBCacheNode *node, void *opaque);
+
+
+enum eviction_type {
+    RBCACHE_FIFO,
+    RBCACHE_LRU,
+};
+
+/**
+ * rbcache_search:
+ * @rbcache: the cache object.
+ * @offset: the start of the range.
+ * @bytes: the size of the range.
+ *
+ * Returns the node corresponding to the range(offset, bytes),
+ * or NULL if the node was not found.
+ */
+void *rbcache_search(RBCache *rbcache, uint64_t offset, uint64_t bytes);
+
+/**
+ * rbcache_insert:
+ * @rbcache: the cache object.
+ * @node: a new node for the cache.
+ *
+ * Returns the new node, or old node if the node already exists.
+ */
+void *rbcache_insert(RBCache *rbcache, RBCacheNode *node);
+
+/**
+ * rbcache_search_and_insert:
+ * @rbcache: the cache object.
+ * @offset: the start of the range.
+ * @bytes: the size of the range.
+ *
+ * rbcache_search_and_insert() is like rbcache_insert(), except that a new node
+ * is allocated inside the function. Returns the new node, or old node if the
+ * node already exists.
+ */
+void *rbcache_search_and_insert(RBCache *rbcache, uint64_t offset,
+                                uint64_t byte);
+
+/**
+ * rbcache_remove:
+ * @rbcache: the cache object.
+ * @node: the node to remove.
+ *
+ * Removes the cached range owned by the node.
+ */
+void rbcache_remove(RBCache *rbcache, RBCacheNode *node);
+
+RBCacheNode *rbcache_node_alloc(RBCache *rbcache, uint64_t offset,
+                                uint64_t bytes);
+
+void rbcache_node_free(RBCache *rbcache, RBCacheNode *node);
+
+/**
+ * rbcache_create:
+ * @alloc: callback to allocation node, allows to upgrade allocate and expand
+ *         the capabilities of the node.
+ * @free: callback to release node, must be used together with alloc callback.
+ * @limit_size: maximum cache size in bytes.
+ * @eviction_type: method of memory limitation
+ * @opaque: the opaque pointer to pass to the callback.
+ *
+ * Returns the cache object.
+ */
+RBCache *rbcache_create(RBNodeAlloc *alloc, RBNodeFree *free,
+                        uint64_t limit_size, int eviction_type, void *opaque);
+
+/**
+ * rbcache_destroy:
+ * @rbcache: the cache object.
+ *
+ * Cleanup the cache object created with rbcache_create().
+ */
+void rbcache_destroy(RBCache *rbcache);
+
+#endif /* RBCACHE_H */
diff --git a/util/Makefile.objs b/util/Makefile.objs
index 727a567..9fb0de6 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -38,3 +38,4 @@ util-obj-y += qdist.o
 util-obj-y += qht.o
 util-obj-y += range.o
 util-obj-y += rbtree.o
+util-obj-y += rbcache.o
diff --git a/util/rbcache.c b/util/rbcache.c
new file mode 100644
index 0000000..05cfa5a
--- /dev/null
+++ b/util/rbcache.c
@@ -0,0 +1,246 @@
+/*
+ * QEMU Range-Based Cache core
+ *
+ * Copyright (C) 2015-2016 Parallels IP Holdings GmbH. All rights reserved.
+ *
+ * Author: Pavel Butsykin <pbutsykin@virtuozzo.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/rbcache.h"
+
+/* RBCache provides functionality to cache the data from block devices
+ * (basically). The range here is used as the main key for searching and storing
+ * data. The cache is based on red-black trees, so basic operations search,
+ * insert, delete are performed for O(log n).
+ *
+ * It is important to note that QEMU usually does not require a data cache, but
+ * in reality, there are already some cases where a cache of small amounts can
+ * increase performance, so as the data structure was selected red-black trees,
+ * this is a quite simple data structure and show high efficiency on a small
+ * number of elements. Therefore, when the minimum range is 512 bytes, the
+ * recommended size of the cache memory no more than 8-16mb. Also note that this
+ * cache implementation allows to store ranges of different lengths without
+ * alignment.
+ */
+
+struct RBCache {
+    struct RbRoot root;
+    RBNodeAlloc *alloc;
+    RBNodeFree  *free;
+    uint64_t limit_size;
+    uint64_t cur_size;
+    int eviction_type;
+    void *opaque;
+
+    QTAILQ_HEAD(RBCacheNodeHead, RBCacheNode) queue;
+};
+
+static int node_key_cmp(const RBCacheNode *node1, const RBCacheNode *node2)
+{
+    assert(node1 != NULL);
+    assert(node2 != NULL);
+
+    if (node1->offset >= node2->offset + node2->bytes) {
+        return 1;
+    }
+    if (node1->offset + node1->bytes <= node2->offset) {
+        return -1;
+    }
+
+    return 0;
+}
+
+static RBCacheNode *node_previous(RBCacheNode* node, uint64_t target_offset)
+{
+    while (node) {
+        struct RbNode *prev_rb_node = rb_prev(&node->rb_node);
+        RBCacheNode *prev_node;
+        if (prev_rb_node == NULL) {
+            break;
+        }
+        prev_node = container_of(prev_rb_node, RBCacheNode, rb_node);
+        if (prev_node->offset + prev_node->bytes <= target_offset) {
+            break;
+        }
+        node = prev_node;
+    }
+
+    assert(node != NULL);
+
+    return node;
+}
+
+RBCacheNode *rbcache_node_alloc(RBCache *rbcache, uint64_t offset,
+                                uint64_t bytes)
+{
+    RBCacheNode *node;
+
+    if (rbcache->alloc) {
+        node = rbcache->alloc(offset, bytes, rbcache->opaque);
+    } else {
+        node = g_malloc(sizeof(*node));
+    }
+
+    node->offset = offset;
+    node->bytes  = bytes;
+
+    return node;
+}
+
+void rbcache_node_free(RBCache *rbcache, RBCacheNode *node)
+{
+    if (rbcache->free) {
+        rbcache->free(node, rbcache->opaque);
+    } else {
+        g_free(node);
+    }
+}
+
+static void rbcache_try_shrink(RBCache *rbcache)
+{
+    while (rbcache->cur_size > rbcache->limit_size) {
+        RBCacheNode *node;
+        assert(!QTAILQ_EMPTY(&rbcache->queue));
+
+        node = QTAILQ_LAST(&rbcache->queue, RBCacheNodeHead);
+
+        rbcache_remove(rbcache, node);
+    }
+}
+
+static inline void node_move_in_queue(RBCache *rbcache, RBCacheNode *node)
+{
+    if (rbcache->eviction_type == RBCACHE_LRU) {
+        QTAILQ_REMOVE(&rbcache->queue, node, entry);
+        QTAILQ_INSERT_HEAD(&rbcache->queue, node, entry);
+    }
+}
+
+static RBCacheNode *node_insert(RBCache *rbcache, RBCacheNode *node, bool alloc)
+{
+    struct RbNode **new, *parent = NULL;
+
+    assert(rbcache != NULL);
+    assert(node->bytes != 0);
+
+    /* Figure out where to put new node */
+    new = &(rbcache->root.rb_node);
+    while (*new) {
+        RBCacheNode *this = container_of(*new, RBCacheNode, rb_node);
+        int result = node_key_cmp(node, this);
+        if (result == 0) {
+            this = node_previous(this, node->offset);
+            node_move_in_queue(rbcache, this);
+            return this;
+        }
+        parent = *new;
+        new = result < 0 ? &((*new)->rb_left) : &((*new)->rb_right);
+    }
+
+    if (alloc) {
+        node = rbcache_node_alloc(rbcache, node->offset, node->bytes);
+    }
+    /* Add new node and rebalance tree. */
+    rb_link_node(&node->rb_node, parent, new);
+    rb_insert_color(&node->rb_node, &rbcache->root);
+
+    rbcache->cur_size += node->bytes;
+
+    rbcache_try_shrink(rbcache);
+
+    QTAILQ_INSERT_HEAD(&rbcache->queue, node, entry);
+
+    return node;
+}
+
+void *rbcache_search(RBCache *rbcache, uint64_t offset, uint64_t bytes)
+{
+    struct RbNode *rb_node;
+    RBCacheNode node = {
+        .offset = offset,
+        .bytes  = bytes,
+    };
+
+    assert(rbcache != NULL);
+
+    rb_node = rbcache->root.rb_node;
+    while (rb_node) {
+        RBCacheNode *this = container_of(rb_node, RBCacheNode, rb_node);
+        int32_t result = node_key_cmp(&node, this);
+        if (result == 0) {
+            this = node_previous(this, offset);
+            node_move_in_queue(rbcache, this);
+            return this;
+        }
+        rb_node = result < 0 ? rb_node->rb_left : rb_node->rb_right;
+    }
+    return NULL;
+}
+
+void *rbcache_insert(RBCache *rbcache, RBCacheNode *node)
+{
+    return node_insert(rbcache, node, false);
+}
+
+void *rbcache_search_and_insert(RBCache *rbcache, uint64_t offset,
+                                uint64_t bytes)
+{
+    RBCacheNode node = {
+        .offset = offset,
+        .bytes  = bytes,
+    };
+
+    return node_insert(rbcache, &node, true);
+}
+
+void rbcache_remove(RBCache *rbcache, RBCacheNode *node)
+{
+    assert(rbcache->cur_size >= node->bytes);
+
+    rbcache->cur_size -= node->bytes;
+    rb_erase(&node->rb_node, &rbcache->root);
+
+    QTAILQ_REMOVE(&rbcache->queue, node, entry);
+
+    rbcache_node_free(rbcache, node);
+}
+
+RBCache *rbcache_create(RBNodeAlloc *alloc, RBNodeFree *free,
+                        uint64_t limit_size, int eviction_type, void *opaque)
+{
+    RBCache *rbcache = g_malloc(sizeof(*rbcache));
+
+
+    /* We can't use only one callback, or both or neither */
+    assert(!(!alloc ^ !free));
+
+    rbcache->root   = RB_ROOT;
+    rbcache->alloc  = alloc;
+    rbcache->free   = free;
+    rbcache->opaque = opaque;
+    rbcache->limit_size = limit_size;
+    rbcache->cur_size   = 0;
+    rbcache->eviction_type = eviction_type;
+
+    QTAILQ_INIT(&rbcache->queue);
+
+    return rbcache;
+}
+
+void rbcache_destroy(RBCache *rbcache)
+{
+    RBCacheNode *node, *next;
+
+    assert(rbcache != NULL);
+
+    QTAILQ_FOREACH_SAFE(node, &rbcache->queue, entry, next) {
+        QTAILQ_REMOVE(&rbcache->queue, node, entry);
+        rbcache_node_free(rbcache, node);
+    }
+
+    g_free(rbcache);
+}
-- 
2.10.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v1 05/18] tests/test-rbcache: add test cases
  2016-11-15  6:36 [Qemu-devel] [PATCH v1 00/18] I/O prefetch cache Pavel Butsykin
                   ` (3 preceding siblings ...)
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 04/18] util/rbcache: range-based cache core Pavel Butsykin
@ 2016-11-15  6:37 ` Pavel Butsykin
  2016-11-24 12:20   ` Kevin Wolf
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 06/18] block/pcache: statistics collection read requests Pavel Butsykin
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 32+ messages in thread
From: Pavel Butsykin @ 2016-11-15  6:37 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: den, famz, stefanha, mreitz, kwolf, eblake

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 tests/Makefile.include |   3 +
 tests/test-rbcache.c   | 308 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 311 insertions(+)
 create mode 100644 tests/test-rbcache.c

diff --git a/tests/Makefile.include b/tests/Makefile.include
index 8162f6f..36bb472 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -54,6 +54,8 @@ check-unit-y += tests/test-hbitmap$(EXESUF)
 gcov-files-test-hbitmap-y = blockjob.c
 check-unit-y += tests/test-blockjob$(EXESUF)
 check-unit-y += tests/test-blockjob-txn$(EXESUF)
+gcov-files-test-rbcache-y = util/rbcache.c
+check-unit-y += tests/test-rbcache$(EXESUF)
 check-unit-y += tests/test-x86-cpuid$(EXESUF)
 # all code tested by test-x86-cpuid is inside topology.h
 gcov-files-test-x86-cpuid-y =
@@ -481,6 +483,7 @@ tests/test-blockjob-txn$(EXESUF): tests/test-blockjob-txn.o $(test-block-obj-y)
 tests/test-thread-pool$(EXESUF): tests/test-thread-pool.o $(test-block-obj-y)
 tests/test-iov$(EXESUF): tests/test-iov.o $(test-util-obj-y)
 tests/test-hbitmap$(EXESUF): tests/test-hbitmap.o $(test-util-obj-y)
+tests/test-rbcache$(EXESUF): tests/test-rbcache.o $(test-util-obj-y)
 tests/test-x86-cpuid$(EXESUF): tests/test-x86-cpuid.o
 tests/test-xbzrle$(EXESUF): tests/test-xbzrle.o migration/xbzrle.o page_cache.o $(test-util-obj-y)
 tests/test-cutils$(EXESUF): tests/test-cutils.o util/cutils.o
diff --git a/tests/test-rbcache.c b/tests/test-rbcache.c
new file mode 100644
index 0000000..1c72bfa
--- /dev/null
+++ b/tests/test-rbcache.c
@@ -0,0 +1,308 @@
+/*
+ * QEMU Range-Based Cache core
+ *
+ * Copyright (C) 2015-2016 Parallels IP Holdings GmbH. All rights reserved.
+ *
+ * Author: Pavel Butsykin <pbutsykin@virtuozzo.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/rbcache.h"
+
+typedef struct TestRBCacheData {
+    RBCache *cache;
+} TestRBCacheData;
+
+typedef struct TestRBCacheConfig {
+    uint64_t limit_size;
+    int eviction_type;
+    RBNodeAlloc *alloc;
+    RBNodeFree  *free;
+    void *opaque;
+} TestRBCacheConfig;
+
+#define KB(_n) ((_n) << 10)
+#define MB(_n) ((_n) << 20)
+
+#define OFFSET1 0
+#define SIZE1   KB(1)
+
+#define OFFSET2 KB(1)
+#define SIZE2   KB(2)
+
+#define OFFSET3 KB(15)
+#define SIZE3   KB(1)
+
+#define OFFSET4 KB(7)
+#define SIZE4   KB(7)
+
+#define OFFSET5 KB(2)
+#define SIZE5   KB(8)
+
+
+static void test_rbcache_init(TestRBCacheData *data, const void *ctx)
+{
+    g_assert_nonnull(data->cache);
+}
+
+static void test_rbcache_insert(TestRBCacheData *data, const void *ctx)
+{
+    RBCacheNode *node1 = rbcache_node_alloc(data->cache, OFFSET1, SIZE1);
+    RBCacheNode *node2 = rbcache_node_alloc(data->cache, OFFSET2, SIZE2);
+    RBCacheNode *node3 = rbcache_node_alloc(data->cache, OFFSET3, SIZE3);
+    RBCacheNode *node4 = rbcache_node_alloc(data->cache, OFFSET4, SIZE4);
+    RBCacheNode *node5 = rbcache_node_alloc(data->cache, OFFSET5, SIZE5);
+    RBCacheNode *node;
+
+    node = rbcache_insert(data->cache, node1);
+    g_assert_true(node == node1);
+
+    node = rbcache_insert(data->cache, node2);
+    g_assert_true(node == node2);
+
+    node = rbcache_insert(data->cache, node3);
+    g_assert_true(node == node3);
+
+    node = rbcache_insert(data->cache, node4);
+    g_assert_true(node == node4);
+
+    node = rbcache_insert(data->cache, node5);
+    g_assert_true(node != node5);
+
+    rbcache_node_free(data->cache, node5);
+}
+
+static void test_rbcache_search(TestRBCacheData *data, const void *ctx)
+{
+    RBCacheNode *node;
+
+    test_rbcache_insert(data, ctx);
+
+    node = rbcache_search(data->cache, OFFSET1, SIZE1);
+    g_assert_nonnull(node);
+    g_assert_cmpuint(node->offset, ==, OFFSET1);
+    g_assert_cmpuint(node->bytes, ==, SIZE1);
+
+    node = rbcache_search(data->cache, OFFSET2 + KB(1), SIZE1);
+    g_assert_nonnull(node);
+    g_assert_cmpuint(node->offset, ==, OFFSET2);
+    g_assert_cmpuint(node->bytes, ==, SIZE2);
+
+    node = rbcache_search(data->cache, OFFSET5, SIZE5);
+    g_assert_nonnull(node);
+    g_assert_cmpuint(node->offset, ==, OFFSET2);
+    g_assert_cmpuint(node->bytes, ==, SIZE2);
+
+    node = rbcache_search(data->cache, OFFSET5 + KB(2), SIZE5);
+    g_assert_nonnull(node);
+    g_assert_cmpuint(node->offset, ==, OFFSET4);
+    g_assert_cmpuint(node->bytes, ==, SIZE4);
+
+    node = rbcache_search(data->cache, OFFSET3 + SIZE3, SIZE3);
+    g_assert_null(node);
+}
+
+static void test_rbcache_search_and_insert(TestRBCacheData *data,
+                                           const void *ctx)
+{
+    RBCacheNode *node;
+
+    node = rbcache_search_and_insert(data->cache, OFFSET1, SIZE1);
+    g_assert_nonnull(node);
+    g_assert_cmpuint(node->offset, ==, OFFSET1);
+    g_assert_cmpuint(node->bytes, ==, SIZE1);
+
+    node = rbcache_search_and_insert(data->cache, OFFSET2, SIZE2);
+    g_assert_nonnull(node);
+    g_assert_cmpuint(node->offset, ==, OFFSET2);
+    g_assert_nonnull(node);
+
+    node = rbcache_search_and_insert(data->cache, OFFSET3, SIZE3);
+    g_assert_nonnull(node);
+    g_assert_cmpuint(node->offset, ==, OFFSET3);
+    g_assert_cmpuint(node->bytes, ==, SIZE3);
+
+    node = rbcache_search_and_insert(data->cache, OFFSET4, SIZE4);
+    g_assert_nonnull(node);
+    g_assert_cmpuint(node->offset, ==, OFFSET4);
+    g_assert_cmpuint(node->bytes, ==, SIZE4);
+
+    node = rbcache_search_and_insert(data->cache, OFFSET5, SIZE5);
+    g_assert_nonnull(node);
+    g_assert_cmpuint(node->offset, ==, OFFSET2);
+    g_assert_cmpuint(node->bytes, ==, SIZE2);
+}
+
+static void test_rbcache_remove(TestRBCacheData *data, const void *ctx)
+{
+    RBCacheNode *node;
+
+    test_rbcache_search_and_insert(data, ctx);
+
+    node = rbcache_search(data->cache, OFFSET1, SIZE1);
+    g_assert_nonnull(node);
+    rbcache_remove(data->cache, node);
+    node = rbcache_search(data->cache, OFFSET1, SIZE1);
+    g_assert_null(node);
+
+    node = rbcache_search(data->cache, OFFSET3, SIZE3);
+    g_assert_nonnull(node);
+    rbcache_remove(data->cache, node);
+    node = rbcache_search(data->cache, OFFSET3, SIZE3);
+    g_assert_null(node);
+
+    node = rbcache_search(data->cache, OFFSET4, SIZE4);
+    g_assert_nonnull(node);
+    rbcache_remove(data->cache, node);
+    node = rbcache_search(data->cache, OFFSET4, SIZE4);
+    g_assert_null(node);
+
+    node = rbcache_search(data->cache, OFFSET2, SIZE2);
+    g_assert_nonnull(node);
+    rbcache_remove(data->cache, node);
+    node = rbcache_search(data->cache, OFFSET2, SIZE2);
+    g_assert_null(node);
+}
+
+static void test_rbcache_shrink(TestRBCacheData *data, const void *ctx)
+{
+    RBCacheNode *node;
+
+    node = rbcache_search_and_insert(data->cache, 0, MB(2));
+    g_assert_nonnull(node);
+
+    node = rbcache_search_and_insert(data->cache, MB(2), MB(3));
+    g_assert_nonnull(node);
+
+    node = rbcache_search(data->cache, 0, MB(2));
+    g_assert_null(node);
+
+    node = rbcache_search(data->cache, MB(2), MB(3));
+    g_assert_nonnull(node);
+
+    node = rbcache_search_and_insert(data->cache, 0, MB(2));
+    g_assert_nonnull(node);
+
+    node = rbcache_search(data->cache, 0, MB(2));
+    g_assert_nonnull(node);
+
+    node = rbcache_search(data->cache, MB(2), MB(3));
+    g_assert_null(node);
+}
+
+static void test_rbcache_shrink_fifo(TestRBCacheData *data, const void *ctx)
+{
+    RBCacheNode *node;
+
+    rbcache_search_and_insert(data->cache, 0, MB(1));
+    rbcache_search_and_insert(data->cache, MB(1), MB(1));
+    rbcache_search_and_insert(data->cache, MB(2), MB(1));
+    rbcache_search_and_insert(data->cache, MB(3), MB(1));
+
+    node = rbcache_search_and_insert(data->cache, MB(4), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, 0, MB(1));
+    g_assert_null(node);
+
+    node = rbcache_search_and_insert(data->cache, MB(5), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, MB(1), MB(1));
+    g_assert_null(node);
+
+    node = rbcache_search_and_insert(data->cache, MB(6), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, MB(2), MB(1));
+    g_assert_null(node);
+
+    node = rbcache_search_and_insert(data->cache, MB(7), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, MB(3), MB(1));
+    g_assert_null(node);
+}
+
+static void test_rbcache_shrink_lru(TestRBCacheData *data, const void *ctx)
+{
+    RBCacheNode *node;
+
+    rbcache_search_and_insert(data->cache, 0, MB(1));
+    rbcache_search_and_insert(data->cache, MB(1), MB(1));
+    rbcache_search_and_insert(data->cache, MB(2), MB(1));
+    rbcache_search_and_insert(data->cache, MB(3), MB(1));
+
+    node = rbcache_search(data->cache, MB(2), MB(1));
+    g_assert_nonnull(node);
+
+    node = rbcache_search(data->cache, MB(1), MB(1));
+    g_assert_nonnull(node);
+
+    node = rbcache_search_and_insert(data->cache, MB(4), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, 0, MB(1));
+    g_assert_null(node);
+
+    node = rbcache_search_and_insert(data->cache, MB(5), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, MB(3), MB(1));
+    g_assert_null(node);
+
+    node = rbcache_search_and_insert(data->cache, MB(6), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, MB(2), MB(1));
+    g_assert_null(node);
+
+    node = rbcache_search_and_insert(data->cache, MB(7), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, MB(1), MB(1));
+    g_assert_null(node);
+}
+
+static void test_rbcache_setup(TestRBCacheData *data, const void *ctx)
+{
+    const TestRBCacheConfig *config = ctx;
+
+    data->cache =
+        rbcache_create(config->alloc, config->free, config->limit_size,
+                       config->eviction_type, config->opaque);
+}
+
+static void test_rbcache_teardown(TestRBCacheData *data, const void *ctx)
+{
+    rbcache_destroy(data->cache);
+}
+
+static void rbcache_test_add(const char *testpath,
+                             void (*test_func)(TestRBCacheData *data,
+                             const void *user_data), void *ctx)
+{
+    g_test_add(testpath, TestRBCacheData, ctx, test_rbcache_setup, test_func,
+               test_rbcache_teardown);
+}
+
+int main(int argc, char **argv)
+{
+    TestRBCacheConfig config = {
+        .limit_size = MB(4),
+        .eviction_type = RBCACHE_FIFO,
+    };
+
+    g_test_init(&argc, &argv, NULL);
+
+    rbcache_test_add("/rbcache/init", test_rbcache_init, &config);
+    rbcache_test_add("/rbcache/insert", test_rbcache_insert, &config);
+    rbcache_test_add("/rbcache/search", test_rbcache_search, &config);
+    rbcache_test_add("/rbcache/search_and_insert",
+                     test_rbcache_search_and_insert, &config);
+    rbcache_test_add("/rbcache/rbcache_remove", test_rbcache_remove, &config);
+    rbcache_test_add("/rbcache/shrink", test_rbcache_shrink, &config);
+    rbcache_test_add("/rbcache/shrink/fifo", test_rbcache_shrink_fifo, &config);
+
+    config.eviction_type = RBCACHE_LRU;
+    rbcache_test_add("/rbcache/shrink/lru", test_rbcache_shrink_lru, &config);
+
+    g_test_run();
+
+    return 0;
+}
-- 
2.10.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v1 06/18] block/pcache: statistics collection read requests
  2016-11-15  6:36 [Qemu-devel] [PATCH v1 00/18] I/O prefetch cache Pavel Butsykin
                   ` (4 preceding siblings ...)
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 05/18] tests/test-rbcache: add test cases Pavel Butsykin
@ 2016-11-15  6:37 ` Pavel Butsykin
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 07/18] block/pcache: skip large aio read Pavel Butsykin
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 32+ messages in thread
From: Pavel Butsykin @ 2016-11-15  6:37 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: den, famz, stefanha, mreitz, kwolf, eblake

Here the rbcache is used as a repository statistics requests, in fact, data
requests are not cached, so we have the ability to store a large  number of
requests. We need statistics requests to determine the sequential requests.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/pcache.c | 31 ++++++++++++++++++++++++++++++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/block/pcache.c b/block/pcache.c
index 59461df..60b1f93 100644
--- a/block/pcache.c
+++ b/block/pcache.c
@@ -13,7 +13,9 @@
 #include "block/block_int.h"
 #include "qapi/error.h"
 #include "qapi/qmp/qstring.h"
+#include "qemu/rbcache.h"
 
+#define PCACHE_OPT_STATS_SIZE "pcache-stats-size"
 
 static QemuOptsList runtime_opts = {
     .name = "pcache",
@@ -24,10 +26,22 @@ static QemuOptsList runtime_opts = {
             .type = QEMU_OPT_STRING,
             .help = "[internal use only, will be removed]",
         },
+        {
+            .name = PCACHE_OPT_STATS_SIZE,
+            .type = QEMU_OPT_SIZE,
+            .help = "Total volume of requests for statistics",
+        },
         { /* end of list */ }
     },
 };
 
+#define MB_BITS 20
+#define PCACHE_DEFAULT_STATS_SIZE (3 << MB_BITS)
+
+typedef struct BDRVPCacheState {
+    RBCache *req_stats;
+} BDRVPCacheState;
+
 typedef struct PCacheAIOCB {
     Coroutine *co;
     int ret;
@@ -45,10 +59,13 @@ static coroutine_fn int pcache_co_preadv(BlockDriverState *bs, uint64_t offset,
                                          uint64_t bytes, QEMUIOVector *qiov,
                                          int flags)
 {
+    BDRVPCacheState *s = bs->opaque;
     PCacheAIOCB acb = {
         .co = qemu_coroutine_self(),
     };
 
+    rbcache_search_and_insert(s->req_stats, offset, bytes);
+
     bdrv_aio_preadv(bs->file, offset, qiov, bytes, pcache_aio_cb, &acb);
 
     qemu_coroutine_yield();
@@ -71,6 +88,13 @@ static coroutine_fn int pcache_co_pwritev(BlockDriverState *bs, uint64_t offset,
     return acb.ret;
 }
 
+static void pcache_state_init(QemuOpts *opts, BDRVPCacheState *s)
+{
+    uint64_t stats_size = qemu_opt_get_size(opts, PCACHE_OPT_STATS_SIZE,
+                                            PCACHE_DEFAULT_STATS_SIZE);
+    s->req_stats = rbcache_create(NULL, NULL, stats_size, RBCACHE_FIFO, s);
+}
+
 static int pcache_file_open(BlockDriverState *bs, QDict *options, int flags,
                             Error **errp)
 {
@@ -92,7 +116,9 @@ static int pcache_file_open(BlockDriverState *bs, QDict *options, int flags,
     if (local_err) {
         ret = -EINVAL;
         error_propagate(errp, local_err);
+        goto fail;
     }
+    pcache_state_init(opts, bs->opaque);
 fail:
     qemu_opts_del(opts);
     return ret;
@@ -100,6 +126,9 @@ fail:
 
 static void pcache_close(BlockDriverState *bs)
 {
+    BDRVPCacheState *s = bs->opaque;
+
+    rbcache_destroy(s->req_stats);
 }
 
 static void pcache_parse_filename(const char *filename, QDict *options,
@@ -122,7 +151,7 @@ static bool pcache_recurse_is_first_non_filter(BlockDriverState *bs,
 static BlockDriver bdrv_pcache = {
     .format_name                        = "pcache",
     .protocol_name                      = "pcache",
-    .instance_size                      = 0,
+    .instance_size                      = sizeof(BDRVPCacheState),
 
     .bdrv_parse_filename                = pcache_parse_filename,
     .bdrv_file_open                     = pcache_file_open,
-- 
2.10.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v1 07/18] block/pcache: skip large aio read
  2016-11-15  6:36 [Qemu-devel] [PATCH v1 00/18] I/O prefetch cache Pavel Butsykin
                   ` (5 preceding siblings ...)
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 06/18] block/pcache: statistics collection read requests Pavel Butsykin
@ 2016-11-15  6:37 ` Pavel Butsykin
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 08/18] block/pcache: updating statistics for overlapping requests Pavel Butsykin
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 32+ messages in thread
From: Pavel Butsykin @ 2016-11-15  6:37 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: den, famz, stefanha, mreitz, kwolf, eblake

This change will allow more efficient use of cache memory and filter the case
for which the pcache isn't efficient.  We skip requests that are not required in
the optimization and thereby reducing the number of unnecessary readaheads.

Add pcache-max-aio-size open parameter.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/pcache.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/block/pcache.c b/block/pcache.c
index 60b1f93..bfc7e97 100644
--- a/block/pcache.c
+++ b/block/pcache.c
@@ -16,6 +16,7 @@
 #include "qemu/rbcache.h"
 
 #define PCACHE_OPT_STATS_SIZE "pcache-stats-size"
+#define PCACHE_OPT_MAX_AIO_SIZE "pcache-max-aio-size"
 
 static QemuOptsList runtime_opts = {
     .name = "pcache",
@@ -31,15 +32,23 @@ static QemuOptsList runtime_opts = {
             .type = QEMU_OPT_SIZE,
             .help = "Total volume of requests for statistics",
         },
+        {
+            .name = PCACHE_OPT_MAX_AIO_SIZE,
+            .type = QEMU_OPT_SIZE,
+            .help = "Maximum size of aio which is handled by pcache",
+        },
         { /* end of list */ }
     },
 };
 
+#define KB_BITS 10
 #define MB_BITS 20
 #define PCACHE_DEFAULT_STATS_SIZE (3 << MB_BITS)
+#define PCACHE_DEFAULT_MAX_AIO_SIZE (64 << KB_BITS)
 
 typedef struct BDRVPCacheState {
     RBCache *req_stats;
+    uint64_t max_aio_size;
 } BDRVPCacheState;
 
 typedef struct PCacheAIOCB {
@@ -64,7 +73,9 @@ static coroutine_fn int pcache_co_preadv(BlockDriverState *bs, uint64_t offset,
         .co = qemu_coroutine_self(),
     };
 
-    rbcache_search_and_insert(s->req_stats, offset, bytes);
+    if (s->max_aio_size >= bytes) {
+        rbcache_search_and_insert(s->req_stats, offset, bytes);
+    }
 
     bdrv_aio_preadv(bs->file, offset, qiov, bytes, pcache_aio_cb, &acb);
 
@@ -93,6 +104,9 @@ static void pcache_state_init(QemuOpts *opts, BDRVPCacheState *s)
     uint64_t stats_size = qemu_opt_get_size(opts, PCACHE_OPT_STATS_SIZE,
                                             PCACHE_DEFAULT_STATS_SIZE);
     s->req_stats = rbcache_create(NULL, NULL, stats_size, RBCACHE_FIFO, s);
+
+    s->max_aio_size = qemu_opt_get_size(opts, PCACHE_OPT_MAX_AIO_SIZE,
+                                        PCACHE_DEFAULT_MAX_AIO_SIZE);
 }
 
 static int pcache_file_open(BlockDriverState *bs, QDict *options, int flags,
-- 
2.10.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v1 08/18] block/pcache: updating statistics for overlapping requests
  2016-11-15  6:36 [Qemu-devel] [PATCH v1 00/18] I/O prefetch cache Pavel Butsykin
                   ` (6 preceding siblings ...)
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 07/18] block/pcache: skip large aio read Pavel Butsykin
@ 2016-11-15  6:37 ` Pavel Butsykin
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 09/18] block/pcache: add AIO readahead Pavel Butsykin
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 32+ messages in thread
From: Pavel Butsykin @ 2016-11-15  6:37 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: den, famz, stefanha, mreitz, kwolf, eblake

When updating the statistics sometimes i/O requests can overlap each other,
in this case the requests are not stored in the statistics. It's not very good,
especially when the requests have a small range of intersection.

We can cut the requests in the intersection and add the pieces of requests in
the statistics. Maybe not significantly, but it can make the statistics more
accurate. Here, update_req_stats() adds the ability to save overlapping
requests.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/pcache.c | 36 +++++++++++++++++++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/block/pcache.c b/block/pcache.c
index bfc7e97..dd598f3 100644
--- a/block/pcache.c
+++ b/block/pcache.c
@@ -64,6 +64,40 @@ static void pcache_aio_cb(void *opaque, int ret)
     qemu_coroutine_enter(acb->co);
 }
 
+static void update_req_stats(RBCache *rbcache, uint64_t offset, uint64_t bytes)
+{
+    do {
+        RBCacheNode *node = rbcache_search_and_insert(rbcache, offset, bytes);
+        /* The node was successfully added or already exists */
+        if (node->offset <= offset &&
+            node->offset + node->bytes >= offset + bytes)
+        {
+            break;
+        }
+
+        /* Request covers the whole node */
+        if (offset <= node->offset &&
+            offset + bytes >= node->offset + node->bytes)
+        {
+            rbcache_remove(rbcache, node);
+            continue;
+        }
+
+        if (offset < node->offset) {
+            RBCacheNode *new_node =
+                rbcache_node_alloc(rbcache, offset, node->offset - offset);
+            if (new_node != rbcache_insert(rbcache, new_node)) {
+                /* The presence of the node in this range is impossible */
+                abort();
+            }
+            break;
+        }
+
+        bytes = (offset + bytes) - (node->offset + node->bytes);
+        offset = node->offset + node->bytes;
+    } while (true);
+}
+
 static coroutine_fn int pcache_co_preadv(BlockDriverState *bs, uint64_t offset,
                                          uint64_t bytes, QEMUIOVector *qiov,
                                          int flags)
@@ -74,7 +108,7 @@ static coroutine_fn int pcache_co_preadv(BlockDriverState *bs, uint64_t offset,
     };
 
     if (s->max_aio_size >= bytes) {
-        rbcache_search_and_insert(s->req_stats, offset, bytes);
+        update_req_stats(s->req_stats, offset, bytes);
     }
 
     bdrv_aio_preadv(bs->file, offset, qiov, bytes, pcache_aio_cb, &acb);
-- 
2.10.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v1 09/18] block/pcache: add AIO readahead
  2016-11-15  6:36 [Qemu-devel] [PATCH v1 00/18] I/O prefetch cache Pavel Butsykin
                   ` (7 preceding siblings ...)
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 08/18] block/pcache: updating statistics for overlapping requests Pavel Butsykin
@ 2016-11-15  6:37 ` Pavel Butsykin
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 10/18] block/pcache: skip readahead for unallocated clusters Pavel Butsykin
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 32+ messages in thread
From: Pavel Butsykin @ 2016-11-15  6:37 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: den, famz, stefanha, mreitz, kwolf, eblake

This patch adds readahead data to the cache. Here the readahead is a separate
asynchronous request, which doesn't depend on completion of filtered read
requests. The readahead is done only by the condition, if before the current
request there's sequential read data enough size. This information can give
the request statistics, of course this method of detection is not very reliable,
but in most cases it'll work.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/pcache.c | 229 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 227 insertions(+), 2 deletions(-)

diff --git a/block/pcache.c b/block/pcache.c
index dd598f3..3717037 100644
--- a/block/pcache.c
+++ b/block/pcache.c
@@ -17,6 +17,8 @@
 
 #define PCACHE_OPT_STATS_SIZE "pcache-stats-size"
 #define PCACHE_OPT_MAX_AIO_SIZE "pcache-max-aio-size"
+#define PCACHE_OPT_CACHE_SIZE "pcache-full-size"
+#define PCACHE_OPT_READAHEAD_SIZE "pcache-readahead-size"
 
 static QemuOptsList runtime_opts = {
     .name = "pcache",
@@ -37,6 +39,16 @@ static QemuOptsList runtime_opts = {
             .type = QEMU_OPT_SIZE,
             .help = "Maximum size of aio which is handled by pcache",
         },
+        {
+            .name = PCACHE_OPT_CACHE_SIZE,
+            .type = QEMU_OPT_SIZE,
+            .help = "Total cache size",
+        },
+        {
+            .name = PCACHE_OPT_READAHEAD_SIZE,
+            .type = QEMU_OPT_SIZE,
+            .help = "Prefetch cache readahead size",
+        },
         { /* end of list */ }
     },
 };
@@ -45,17 +57,61 @@ static QemuOptsList runtime_opts = {
 #define MB_BITS 20
 #define PCACHE_DEFAULT_STATS_SIZE (3 << MB_BITS)
 #define PCACHE_DEFAULT_MAX_AIO_SIZE (64 << KB_BITS)
+#define PCACHE_DEFAULT_CACHE_SIZE (4 << MB_BITS)
+#define PCACHE_DEFAULT_READAHEAD_SIZE (128 << KB_BITS)
 
 typedef struct BDRVPCacheState {
     RBCache *req_stats;
+    RBCache *cache;
     uint64_t max_aio_size;
+    uint64_t readahead_size;
 } BDRVPCacheState;
 
+typedef struct PCacheNode {
+    RBCacheNode common;
+    uint8_t *data;
+    enum {
+        NODE_STATUS_NEW       = 0,
+        NODE_STATUS_INFLIGHT  = 1,
+        NODE_STATUS_COMPLETED = 2,
+        NODE_STATUS_REMOVE    = 3,
+        NODE_STATUS_DELETED   = 4, /* only for debugging */
+    } status;
+    int ref;
+} PCacheNode;
+
 typedef struct PCacheAIOCB {
+    BlockDriverState *bs;
     Coroutine *co;
+    uint64_t offset;
+    uint64_t bytes;
     int ret;
 } PCacheAIOCB;
 
+typedef struct PCacheAIOCBReadahead {
+    BlockDriverState *bs;
+    Coroutine *co;
+    QEMUIOVector qiov;
+    PCacheNode *node;
+} PCacheAIOCBReadahead;
+
+static inline void pcache_node_ref(PCacheNode *node)
+{
+    node->ref++;
+}
+
+static void pcache_node_unref(PCacheNode *node)
+{
+    assert(node->ref > 0);
+    if (--node->ref == 0) {
+        assert(node->status == NODE_STATUS_REMOVE);
+        node->status = NODE_STATUS_DELETED;
+
+        g_free(node->data);
+        g_free(node);
+    }
+}
+
 static void pcache_aio_cb(void *opaque, int ret)
 {
     PCacheAIOCB *acb = opaque;
@@ -64,6 +120,27 @@ static void pcache_aio_cb(void *opaque, int ret)
     qemu_coroutine_enter(acb->co);
 }
 
+static void pcache_aio_readahead_cb(void *opaque, int ret)
+{
+    PCacheAIOCBReadahead *acb = opaque;
+    PCacheNode *node = acb->node;
+
+    assert(node->status == NODE_STATUS_INFLIGHT ||
+           node->status == NODE_STATUS_REMOVE);
+
+    if (node->status == NODE_STATUS_INFLIGHT) {
+        if (ret == 0) {
+            node->status = NODE_STATUS_COMPLETED;
+        } else {
+            BDRVPCacheState *s = acb->bs->opaque;
+            rbcache_remove(s->cache, &node->common);
+        }
+    }
+    pcache_node_unref(node);
+
+    qemu_coroutine_enter(acb->co);
+}
+
 static void update_req_stats(RBCache *rbcache, uint64_t offset, uint64_t bytes)
 {
     do {
@@ -98,6 +175,138 @@ static void update_req_stats(RBCache *rbcache, uint64_t offset, uint64_t bytes)
     } while (true);
 }
 
+static bool check_request_sequence(BDRVPCacheState *s, uint64_t offset)
+{
+    uint64_t cache_line_size = s->readahead_size;
+    uint64_t check_offset;
+
+    if (offset <= cache_line_size) {
+        return false;
+    }
+    check_offset = offset - cache_line_size;
+
+    do {
+        RBCacheNode *node = rbcache_search(s->req_stats, check_offset,
+                                           offset - check_offset);
+        if (node == NULL) {
+            return false;
+        }
+        if (node->offset > check_offset) {
+            return false;
+        }
+        check_offset = node->offset + node->bytes;
+    } while (check_offset < offset);
+
+    return true;
+}
+
+static void pcache_node_free(RBCacheNode *rbnode, void *opaque)
+{
+    PCacheNode *node = container_of(rbnode, PCacheNode, common);
+
+    assert(node->status == NODE_STATUS_INFLIGHT ||
+           node->status == NODE_STATUS_COMPLETED);
+
+    node->status = NODE_STATUS_REMOVE;
+    pcache_node_unref(node);
+}
+
+static RBCacheNode *pcache_node_alloc(uint64_t offset, uint64_t bytes,
+                                      void *opaque)
+{
+    PCacheNode *node = g_malloc(sizeof(*node));
+
+    node->data = g_malloc(bytes);
+    node->status = NODE_STATUS_NEW;
+    node->ref = 1;
+
+    return &node->common;
+}
+
+#define PCACHE_STEPS_FORWARD 2
+
+static PCacheNode *get_readahead_node(BlockDriverState *bs, RBCache *rbcache,
+                                      uint64_t offset, uint64_t bytes)
+{
+    uint32_t count = PCACHE_STEPS_FORWARD;
+
+    int64_t total_bytes = bdrv_getlength(bs);
+    if (total_bytes < 0) {
+        return NULL;
+    }
+
+    while (count--) {
+        PCacheNode *node;
+
+        if (total_bytes <= offset + bytes) {
+            break;
+        }
+
+        node = rbcache_search_and_insert(rbcache, offset, bytes);
+        if (node->status == NODE_STATUS_NEW) {
+            return node;
+        }
+         /* The range less than the readahead size is not cached to reduce
+          * fragmentation of the cache. If the data is already cached, then we
+          * just step over it.
+          */
+        if (offset <= node->common.offset && !count--) {
+            break;
+        }
+        offset = node->common.offset + node->common.bytes;
+    };
+
+    return NULL;
+}
+
+static void coroutine_fn pcache_co_readahead(void *opaque)
+{
+    PCacheAIOCB *acb = g_memdup(opaque, sizeof(*acb));
+    BlockDriverState *bs = acb->bs;
+    BDRVPCacheState *s = bs->opaque;
+    uint64_t offset;
+    uint64_t bytes;
+    PCacheAIOCBReadahead readahead_acb;
+    PCacheNode *node;
+
+    if (!check_request_sequence(s, acb->offset)) {
+        goto out;
+    }
+
+    offset = acb->offset + acb->bytes;
+    bytes = s->readahead_size;
+
+    node = get_readahead_node(bs, s->cache, offset, bytes);
+    if (node == NULL) {
+        goto out;
+    }
+
+    readahead_acb = (PCacheAIOCBReadahead) {
+        .co = qemu_coroutine_self(),
+        .bs = bs,
+        .node = node,
+    };
+
+    node->status = NODE_STATUS_INFLIGHT;
+    qemu_iovec_init(&readahead_acb.qiov, 1);
+    qemu_iovec_add(&readahead_acb.qiov, node->data, node->common.bytes);
+
+    pcache_node_ref(node);
+
+    bdrv_aio_preadv(bs->file, node->common.offset, &readahead_acb.qiov,
+                    node->common.bytes, pcache_aio_readahead_cb,
+                    &readahead_acb);
+    qemu_coroutine_yield();
+out:
+    free(acb);
+}
+
+static void pcache_readahead_request(PCacheAIOCB *acb)
+{
+    Coroutine *co = qemu_coroutine_create(pcache_co_readahead, acb);
+    qemu_coroutine_enter(co);
+}
+
 static coroutine_fn int pcache_co_preadv(BlockDriverState *bs, uint64_t offset,
                                          uint64_t bytes, QEMUIOVector *qiov,
                                          int flags)
@@ -105,14 +314,23 @@ static coroutine_fn int pcache_co_preadv(BlockDriverState *bs, uint64_t offset,
     BDRVPCacheState *s = bs->opaque;
     PCacheAIOCB acb = {
         .co = qemu_coroutine_self(),
+        .bs = bs,
+        .offset = offset,
+        .bytes = bytes,
     };
 
-    if (s->max_aio_size >= bytes) {
-        update_req_stats(s->req_stats, offset, bytes);
+    if (bytes > s->max_aio_size) {
+        bdrv_aio_preadv(bs->file, offset, qiov, bytes, pcache_aio_cb, &acb);
+        goto out;
     }
 
+    update_req_stats(s->req_stats, offset, bytes);
+
     bdrv_aio_preadv(bs->file, offset, qiov, bytes, pcache_aio_cb, &acb);
 
+    pcache_readahead_request(&acb);
+
+out:
     qemu_coroutine_yield();
 
     return acb.ret;
@@ -137,10 +355,16 @@ static void pcache_state_init(QemuOpts *opts, BDRVPCacheState *s)
 {
     uint64_t stats_size = qemu_opt_get_size(opts, PCACHE_OPT_STATS_SIZE,
                                             PCACHE_DEFAULT_STATS_SIZE);
+    uint64_t cache_size = qemu_opt_get_size(opts, PCACHE_OPT_CACHE_SIZE,
+                                            PCACHE_DEFAULT_CACHE_SIZE);
     s->req_stats = rbcache_create(NULL, NULL, stats_size, RBCACHE_FIFO, s);
 
     s->max_aio_size = qemu_opt_get_size(opts, PCACHE_OPT_MAX_AIO_SIZE,
                                         PCACHE_DEFAULT_MAX_AIO_SIZE);
+    s->cache = rbcache_create(pcache_node_alloc, pcache_node_free, cache_size,
+                              RBCACHE_LRU, s);
+    s->readahead_size = qemu_opt_get_size(opts, PCACHE_OPT_READAHEAD_SIZE,
+                                          PCACHE_DEFAULT_READAHEAD_SIZE);
 }
 
 static int pcache_file_open(BlockDriverState *bs, QDict *options, int flags,
@@ -177,6 +401,7 @@ static void pcache_close(BlockDriverState *bs)
     BDRVPCacheState *s = bs->opaque;
 
     rbcache_destroy(s->req_stats);
+    rbcache_destroy(s->cache);
 }
 
 static void pcache_parse_filename(const char *filename, QDict *options,
-- 
2.10.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v1 10/18] block/pcache: skip readahead for unallocated clusters
  2016-11-15  6:36 [Qemu-devel] [PATCH v1 00/18] I/O prefetch cache Pavel Butsykin
                   ` (8 preceding siblings ...)
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 09/18] block/pcache: add AIO readahead Pavel Butsykin
@ 2016-11-15  6:37 ` Pavel Butsykin
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 11/18] block/pcache: cache invalidation on AIO write requests Pavel Butsykin
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 32+ messages in thread
From: Pavel Butsykin @ 2016-11-15  6:37 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: den, famz, stefanha, mreitz, kwolf, eblake

Typically, data for unallocated clusters is filled with zeros, so it makes no
sense to store it in the cache.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/pcache.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/block/pcache.c b/block/pcache.c
index 3717037..1f3800d 100644
--- a/block/pcache.c
+++ b/block/pcache.c
@@ -223,6 +223,28 @@ static RBCacheNode *pcache_node_alloc(uint64_t offset, uint64_t bytes,
     return &node->common;
 }
 
+static bool check_allocated_clusters(BlockDriverState *bs, uint64_t offset,
+                                     uint64_t bytes)
+{
+    int64_t sector_num = offset >> BDRV_SECTOR_BITS;
+    int32_t nb_sectors = bytes >> BDRV_SECTOR_BITS;
+
+    assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
+    assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
+
+    do {
+        int num, ret = bdrv_is_allocated(bs, sector_num, nb_sectors, &num);
+        if (ret <= 0) {
+            return false;
+        }
+        sector_num += num;
+        nb_sectors -= num;
+
+    } while (nb_sectors);
+
+    return true;
+}
+
 #define PCACHE_STEPS_FORWARD 2
 
 static PCacheNode *get_readahead_node(BlockDriverState *bs, RBCache *rbcache,
@@ -242,6 +264,10 @@ static PCacheNode *get_readahead_node(BlockDriverState *bs, RBCache *rbcache,
             break;
         }
 
+        if (!check_allocated_clusters(bs, offset, bytes)) {
+            break;
+        }
+
         node = rbcache_search_and_insert(rbcache, offset, bytes);
         if (node->status == NODE_STATUS_NEW) {
             return node;
-- 
2.10.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v1 11/18] block/pcache: cache invalidation on AIO write requests
  2016-11-15  6:36 [Qemu-devel] [PATCH v1 00/18] I/O prefetch cache Pavel Butsykin
                   ` (9 preceding siblings ...)
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 10/18] block/pcache: skip readahead for unallocated clusters Pavel Butsykin
@ 2016-11-15  6:37 ` Pavel Butsykin
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 12/18] block/pcache: add reading data from the cache Pavel Butsykin
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 32+ messages in thread
From: Pavel Butsykin @ 2016-11-15  6:37 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: den, famz, stefanha, mreitz, kwolf, eblake

In AIO write request completion we just drop all the intersecting nodes in the
cache, it's a simple way to keep the cache up-to-date.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/pcache.c | 39 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/block/pcache.c b/block/pcache.c
index 1f3800d..27ee6dd 100644
--- a/block/pcache.c
+++ b/block/pcache.c
@@ -362,15 +362,52 @@ out:
     return acb.ret;
 }
 
+static void pcache_aio_write_cb(void *opaque, int ret)
+{
+    PCacheAIOCB *acb = opaque;
+    BDRVPCacheState *s = acb->bs->opaque;
+    uint64_t offset = acb->offset;
+    uint64_t bytes = acb->bytes;
+    uint64_t end_offs = offset + bytes;
+
+    if (ret < 0) {
+        goto out;
+    }
+
+    do {
+        PCacheNode *node = rbcache_search(s->cache, offset, bytes);
+        if (node == NULL) {
+            break;
+        }
+        assert(node->status == NODE_STATUS_COMPLETED ||
+               node->status == NODE_STATUS_INFLIGHT  ||
+               node->status == NODE_STATUS_REMOVE);
+
+        offset = node->common.offset + node->common.bytes;
+        bytes = end_offs - offset;
+
+        if (node->status == NODE_STATUS_COMPLETED) {
+            rbcache_remove(s->cache, &node->common);
+        }
+    } while (end_offs > offset);
+
+out:
+    acb->ret = ret;
+    qemu_coroutine_enter(acb->co);
+}
+
 static coroutine_fn int pcache_co_pwritev(BlockDriverState *bs, uint64_t offset,
                                           uint64_t bytes, QEMUIOVector *qiov,
                                           int flags)
 {
     PCacheAIOCB acb = {
         .co = qemu_coroutine_self(),
+        .bs = bs,
+        .offset = offset,
+        .bytes = bytes,
     };
 
-    bdrv_aio_pwritev(bs->file, offset, qiov, bytes, pcache_aio_cb, &acb);
+    bdrv_aio_pwritev(bs->file, offset, qiov, bytes, pcache_aio_write_cb, &acb);
 
     qemu_coroutine_yield();
 
-- 
2.10.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v1 12/18] block/pcache: add reading data from the cache
  2016-11-15  6:36 [Qemu-devel] [PATCH v1 00/18] I/O prefetch cache Pavel Butsykin
                   ` (10 preceding siblings ...)
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 11/18] block/pcache: cache invalidation on AIO write requests Pavel Butsykin
@ 2016-11-15  6:37 ` Pavel Butsykin
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 13/18] block/pcache: inflight readahead request waiting for aio read Pavel Butsykin
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 32+ messages in thread
From: Pavel Butsykin @ 2016-11-15  6:37 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: den, famz, stefanha, mreitz, kwolf, eblake

Added read_cache_data() allowing to read data from node to qiov.  And the
simplest use case - read data from node provided that the node is not in flight
and fully covers read request.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/pcache.c | 91 +++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 80 insertions(+), 11 deletions(-)

diff --git a/block/pcache.c b/block/pcache.c
index 27ee6dd..0f918d0 100644
--- a/block/pcache.c
+++ b/block/pcache.c
@@ -80,13 +80,22 @@ typedef struct PCacheNode {
     int ref;
 } PCacheNode;
 
-typedef struct PCacheAIOCB {
+typedef struct PCacheAIOCBWrite {
     BlockDriverState *bs;
     Coroutine *co;
     uint64_t offset;
     uint64_t bytes;
     int ret;
-} PCacheAIOCB;
+} PCacheAIOCBWrite;
+
+typedef struct PCacheAIOCBRead {
+    BlockDriverState *bs;
+    Coroutine *co;
+    uint64_t offset;
+    uint64_t bytes;
+    QEMUIOVector *qiov;
+    int ret;
+} PCacheAIOCBRead;
 
 typedef struct PCacheAIOCBReadahead {
     BlockDriverState *bs;
@@ -112,9 +121,35 @@ static void pcache_node_unref(PCacheNode *node)
     }
 }
 
-static void pcache_aio_cb(void *opaque, int ret)
+static uint64_t ranges_overlap_size(uint64_t offset1, uint64_t size1,
+                                    uint64_t offset2, uint32_t size2)
+{
+    return MIN(offset1 + size1, offset2 + size2) - MAX(offset1, offset2);
+}
+
+static void read_cache_data(PCacheAIOCBRead *acb, PCacheNode *node,
+                            uint64_t offset, uint64_t bytes)
+{
+    uint64_t qiov_offs = 0, node_offs = 0;
+    uint64_t size;
+    uint64_t copy;
+
+    if (offset < node->common.offset) {
+        qiov_offs = node->common.offset - offset;
+    } else {
+        node_offs = offset - node->common.offset;
+    }
+    size = ranges_overlap_size(offset, bytes, node->common.offset,
+                               node->common.bytes);
+
+    copy = qemu_iovec_from_buf(acb->qiov, qiov_offs,
+                               node->data + node_offs, size);
+    assert(copy == size);
+}
+
+static void pcache_aio_read_cb(void *opaque, int ret)
 {
-    PCacheAIOCB *acb = opaque;
+    PCacheAIOCBRead *acb = opaque;
 
     acb->ret = ret;
     qemu_coroutine_enter(acb->co);
@@ -287,7 +322,7 @@ static PCacheNode *get_readahead_node(BlockDriverState *bs, RBCache *rbcache,
 
 static void coroutine_fn pcache_co_readahead(void *opaque)
 {
-    PCacheAIOCB *acb = g_memdup(opaque, sizeof(*acb));
+    PCacheAIOCBRead *acb = g_memdup(opaque, sizeof(*acb));
     BlockDriverState *bs = acb->bs;
     BDRVPCacheState *s = bs->opaque;
     uint64_t offset;
@@ -327,32 +362,66 @@ out:
     free(acb);
 }
 
-static void pcache_readahead_request(PCacheAIOCB *acb)
+static void pcache_readahead_request(PCacheAIOCBRead *acb)
 {
     Coroutine *co = qemu_coroutine_create(pcache_co_readahead, acb);
     qemu_coroutine_enter(co);
 }
 
+enum {
+    CACHE_MISS,
+    CACHE_HIT,
+    PARTIAL_CACHE_HIT,
+};
+
+static int pcache_lookup_data(PCacheAIOCBRead *acb)
+{
+    BDRVPCacheState *s = acb->bs->opaque;
+
+    PCacheNode *node = rbcache_search(s->cache, acb->offset, acb->bytes);
+    if (node == NULL || node->status != NODE_STATUS_COMPLETED) {
+        return CACHE_MISS;
+    }
+
+    /* Node covers the whole request */
+    if (node->common.offset <= acb->offset &&
+        node->common.offset + node->common.bytes >= acb->offset + acb->bytes)
+    {
+        read_cache_data(acb, node, acb->offset, acb->bytes);
+        return CACHE_HIT;
+    }
+
+    return PARTIAL_CACHE_HIT;
+}
+
 static coroutine_fn int pcache_co_preadv(BlockDriverState *bs, uint64_t offset,
                                          uint64_t bytes, QEMUIOVector *qiov,
                                          int flags)
 {
     BDRVPCacheState *s = bs->opaque;
-    PCacheAIOCB acb = {
+    PCacheAIOCBRead acb = {
         .co = qemu_coroutine_self(),
         .bs = bs,
         .offset = offset,
         .bytes = bytes,
+        .qiov = qiov,
     };
+    int status;
 
     if (bytes > s->max_aio_size) {
-        bdrv_aio_preadv(bs->file, offset, qiov, bytes, pcache_aio_cb, &acb);
+        bdrv_aio_preadv(bs->file, offset, qiov, bytes,
+                        pcache_aio_read_cb, &acb);
         goto out;
     }
 
     update_req_stats(s->req_stats, offset, bytes);
 
-    bdrv_aio_preadv(bs->file, offset, qiov, bytes, pcache_aio_cb, &acb);
+    status = pcache_lookup_data(&acb);
+    if (status == CACHE_HIT) {
+        return 0;
+    }
+
+    bdrv_aio_preadv(bs->file, offset, qiov, bytes, pcache_aio_read_cb, &acb);
 
     pcache_readahead_request(&acb);
 
@@ -364,7 +433,7 @@ out:
 
 static void pcache_aio_write_cb(void *opaque, int ret)
 {
-    PCacheAIOCB *acb = opaque;
+    PCacheAIOCBWrite *acb = opaque;
     BDRVPCacheState *s = acb->bs->opaque;
     uint64_t offset = acb->offset;
     uint64_t bytes = acb->bytes;
@@ -400,7 +469,7 @@ static coroutine_fn int pcache_co_pwritev(BlockDriverState *bs, uint64_t offset,
                                           uint64_t bytes, QEMUIOVector *qiov,
                                           int flags)
 {
-    PCacheAIOCB acb = {
+    PCacheAIOCBWrite acb = {
         .co = qemu_coroutine_self(),
         .bs = bs,
         .offset = offset,
-- 
2.10.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v1 13/18] block/pcache: inflight readahead request waiting for aio read
  2016-11-15  6:36 [Qemu-devel] [PATCH v1 00/18] I/O prefetch cache Pavel Butsykin
                   ` (11 preceding siblings ...)
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 12/18] block/pcache: add reading data from the cache Pavel Butsykin
@ 2016-11-15  6:37 ` Pavel Butsykin
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 14/18] backup/pcache: pick up parts of the cache Pavel Butsykin
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 32+ messages in thread
From: Pavel Butsykin @ 2016-11-15  6:37 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: den, famz, stefanha, mreitz, kwolf, eblake

If there was a cache hit, but the status of the node is NODE_STATUS_INFLIGHT,
then this means that a readahead request for this node is in flight. In this case,
we can wait for the completion of the readahead request, and then copy the
data. It allows us even more to optimize aio read requests.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/pcache.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 65 insertions(+), 8 deletions(-)

diff --git a/block/pcache.c b/block/pcache.c
index 0f918d0..55d1061 100644
--- a/block/pcache.c
+++ b/block/pcache.c
@@ -67,9 +67,15 @@ typedef struct BDRVPCacheState {
     uint64_t readahead_size;
 } BDRVPCacheState;
 
+typedef struct ACBLinkEntry {
+    QTAILQ_ENTRY(ACBLinkEntry) entry;
+    struct PCacheAIOCBRead     *acb;
+} ACBLinkEntry;
+
 typedef struct PCacheNode {
     RBCacheNode common;
     uint8_t *data;
+    QTAILQ_HEAD(, ACBLinkEntry) wait_list;
     enum {
         NODE_STATUS_NEW       = 0,
         NODE_STATUS_INFLIGHT  = 1,
@@ -94,6 +100,7 @@ typedef struct PCacheAIOCBRead {
     uint64_t offset;
     uint64_t bytes;
     QEMUIOVector *qiov;
+    int ref;
     int ret;
 } PCacheAIOCBRead;
 
@@ -147,18 +154,49 @@ static void read_cache_data(PCacheAIOCBRead *acb, PCacheNode *node,
     assert(copy == size);
 }
 
+static void pcache_aio_read(PCacheAIOCBRead *acb, PCacheNode *node)
+{
+    ACBLinkEntry *link;
+
+    if (node->status == NODE_STATUS_COMPLETED) {
+        read_cache_data(acb, node, acb->offset, acb->bytes);
+        return;
+    }
+
+    assert(node->status == NODE_STATUS_INFLIGHT ||
+           node->status == NODE_STATUS_REMOVE);
+
+    link = g_malloc(sizeof(*link));
+    link->acb = acb;
+
+    QTAILQ_INSERT_HEAD(&node->wait_list, link, entry);
+    acb->ref++;
+}
+
+static void aio_read_complete(PCacheAIOCBRead *acb, int ret)
+{
+    if (ret < 0 && acb->ret == 0) {
+        acb->ret = ret;
+    }
+
+    assert(acb->ref > 0);
+    if (--acb->ref == 0) {
+        qemu_coroutine_enter(acb->co);
+    }
+}
+
 static void pcache_aio_read_cb(void *opaque, int ret)
 {
     PCacheAIOCBRead *acb = opaque;
 
-    acb->ret = ret;
-    qemu_coroutine_enter(acb->co);
+    aio_read_complete(acb, ret);
 }
 
 static void pcache_aio_readahead_cb(void *opaque, int ret)
 {
     PCacheAIOCBReadahead *acb = opaque;
     PCacheNode *node = acb->node;
+    ACBLinkEntry *link, *next;
 
     assert(node->status == NODE_STATUS_INFLIGHT ||
            node->status == NODE_STATUS_REMOVE);
@@ -171,6 +209,20 @@ static void pcache_aio_readahead_cb(void *opaque, int ret)
             rbcache_remove(s->cache, &node->common);
         }
     }
+
+    QTAILQ_FOREACH_SAFE(link, &node->wait_list, entry, next) {
+        PCacheAIOCBRead *acb_read = link->acb;
+
+        QTAILQ_REMOVE(&node->wait_list, link, entry);
+        g_free(link);
+
+        if (ret == 0) {
+            read_cache_data(acb_read, node, acb_read->offset, acb_read->bytes);
+        }
+
+        aio_read_complete(acb_read, ret);
+    }
+
     pcache_node_unref(node);
 
     qemu_coroutine_enter(acb->co);
@@ -254,6 +306,7 @@ static RBCacheNode *pcache_node_alloc(uint64_t offset, uint64_t bytes,
     node->data = g_malloc(bytes);
     node->status = NODE_STATUS_NEW;
     node->ref = 1;
+    QTAILQ_INIT(&node->wait_list);
 
     return &node->common;
 }
@@ -379,7 +432,7 @@ static int pcache_lookup_data(PCacheAIOCBRead *acb)
     BDRVPCacheState *s = acb->bs->opaque;
 
     PCacheNode *node = rbcache_search(s->cache, acb->offset, acb->bytes);
-    if (node == NULL || node->status != NODE_STATUS_COMPLETED) {
+    if (node == NULL) {
         return CACHE_MISS;
     }
 
@@ -387,7 +440,7 @@ static int pcache_lookup_data(PCacheAIOCBRead *acb)
     if (node->common.offset <= acb->offset &&
         node->common.offset + node->common.bytes >= acb->offset + acb->bytes)
     {
-        read_cache_data(acb, node, acb->offset, acb->bytes);
+        pcache_aio_read(acb, node);
         return CACHE_HIT;
     }
 
@@ -405,6 +458,7 @@ static coroutine_fn int pcache_co_preadv(BlockDriverState *bs, uint64_t offset,
         .offset = offset,
         .bytes = bytes,
         .qiov = qiov,
+        .ref = 1,
     };
     int status;
 
@@ -417,14 +471,17 @@ static coroutine_fn int pcache_co_preadv(BlockDriverState *bs, uint64_t offset,
     update_req_stats(s->req_stats, offset, bytes);
 
     status = pcache_lookup_data(&acb);
-    if (status == CACHE_HIT) {
-        return 0;
+    if (status == CACHE_MISS || status == PARTIAL_CACHE_HIT) {
+        bdrv_aio_preadv(bs->file, offset, qiov, bytes,
+                        pcache_aio_read_cb, &acb);
     }
 
-    bdrv_aio_preadv(bs->file, offset, qiov, bytes, pcache_aio_read_cb, &acb);
-
     pcache_readahead_request(&acb);
 
+    if (status == CACHE_HIT && --acb.ref == 0) {
+        return 0;
+    }
+
 out:
     qemu_coroutine_yield();
 
-- 
2.10.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v1 14/18] backup/pcache: pick up parts of the cache
  2016-11-15  6:36 [Qemu-devel] [PATCH v1 00/18] I/O prefetch cache Pavel Butsykin
                   ` (12 preceding siblings ...)
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 13/18] block/pcache: inflight readahead request waiting for aio read Pavel Butsykin
@ 2016-11-15  6:37 ` Pavel Butsykin
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 15/18] block/pcache: drop used pcache nodes Pavel Butsykin
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 32+ messages in thread
From: Pavel Butsykin @ 2016-11-15  6:37 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: den, famz, stefanha, mreitz, kwolf, eblake

Provided that the request size is less than the readahead size, a partial cache
hit can occur in the following three cases:
1. The request covers the bottom part of the node
2. The request covers the upper part of the node
3. The request is between two nodes and partially covers both of them

The function pickup_parts_of_cache() covers all three cases and for the
uncached part of the request it creates a new request.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/pcache.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 55 insertions(+), 2 deletions(-)

diff --git a/block/pcache.c b/block/pcache.c
index 55d1061..40fa3f3 100644
--- a/block/pcache.c
+++ b/block/pcache.c
@@ -100,6 +100,11 @@ typedef struct PCacheAIOCBRead {
     uint64_t offset;
     uint64_t bytes;
     QEMUIOVector *qiov;
+    struct {
+        QEMUIOVector qiov;
+        uint64_t offset;
+        uint64_t bytes;
+    } part;
     int ref;
     int ret;
 } PCacheAIOCBRead;
@@ -189,6 +194,10 @@ static void pcache_aio_read_cb(void *opaque, int ret)
 {
     PCacheAIOCBRead *acb = opaque;
 
+    if (acb->part.qiov.niov != 0) {
+        qemu_iovec_destroy(&acb->part.qiov);
+    }
+
     aio_read_complete(acb, ret);
 }
 
@@ -421,6 +430,44 @@ static void pcache_readahead_request(PCacheAIOCBRead *acb)
     qemu_coroutine_enter(co);
 }
 
+static void set_part_req(PCacheAIOCBRead *acb, uint64_t offset, uint64_t bytes)
+{
+    acb->part.offset = offset;
+    acb->part.bytes = bytes;
+
+    assert(acb->part.qiov.niov == 0);
+
+    qemu_iovec_init(&acb->part.qiov, acb->qiov->niov);
+    qemu_iovec_concat(&acb->part.qiov, acb->qiov, offset - acb->offset, bytes);
+}
+
+static void pickup_parts_of_cache(PCacheAIOCBRead *acb, PCacheNode *node,
+                                  uint64_t offset, uint64_t bytes)
+{
+    BDRVPCacheState *s = acb->bs->opaque;
+    uint64_t end_offs = offset + bytes;
+
+    do {
+        if (offset < node->common.offset) {
+            set_part_req(acb, offset,  node->common.offset - offset);
+        }
+
+        pcache_aio_read(acb, node);
+
+        offset = node->common.offset + node->common.bytes;
+        if (end_offs <= offset) {
+            break;
+        }
+
+        bytes = end_offs - offset;
+        node = rbcache_search(s->cache, offset, bytes);
+        if (node == NULL) {
+            set_part_req(acb, offset, bytes);
+            break;
+        }
+    } while (true);
+}
+
 enum {
     CACHE_MISS,
     CACHE_HIT,
@@ -444,7 +491,9 @@ static int pcache_lookup_data(PCacheAIOCBRead *acb)
         return CACHE_HIT;
     }
 
-    return PARTIAL_CACHE_HIT;
+    pickup_parts_of_cache(acb, node, acb->offset, acb->bytes);
+
+    return acb->part.qiov.niov == 0 ? CACHE_HIT : PARTIAL_CACHE_HIT;
 }
 
 static coroutine_fn int pcache_co_preadv(BlockDriverState *bs, uint64_t offset,
@@ -471,9 +520,13 @@ static coroutine_fn int pcache_co_preadv(BlockDriverState *bs, uint64_t offset,
     update_req_stats(s->req_stats, offset, bytes);
 
     status = pcache_lookup_data(&acb);
-    if (status == CACHE_MISS || status == PARTIAL_CACHE_HIT) {
+    if (status == CACHE_MISS) {
         bdrv_aio_preadv(bs->file, offset, qiov, bytes,
                         pcache_aio_read_cb, &acb);
+    } else if (status == PARTIAL_CACHE_HIT) {
+        assert(acb.part.qiov.niov != 0);
+        bdrv_aio_preadv(bs->file, acb.part.offset, &acb.part.qiov,
+                        acb.part.bytes, pcache_aio_read_cb, &acb);
     }
 
     pcache_readahead_request(&acb);
-- 
2.10.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v1 15/18] block/pcache: drop used pcache nodes
  2016-11-15  6:36 [Qemu-devel] [PATCH v1 00/18] I/O prefetch cache Pavel Butsykin
                   ` (13 preceding siblings ...)
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 14/18] backup/pcache: pick up parts of the cache Pavel Butsykin
@ 2016-11-15  6:37 ` Pavel Butsykin
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 16/18] block/pcache: write through Pavel Butsykin
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 32+ messages in thread
From: Pavel Butsykin @ 2016-11-15  6:37 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: den, famz, stefanha, mreitz, kwolf, eblake

The pcache is directed to certain situations to sequential reads. This concept
allows to drop parts of the cache that were already used, which will reduce
the size of cache and the number of displaced nodes.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/pcache.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/block/pcache.c b/block/pcache.c
index 40fa3f3..df4834a 100644
--- a/block/pcache.c
+++ b/block/pcache.c
@@ -83,6 +83,7 @@ typedef struct PCacheNode {
         NODE_STATUS_REMOVE    = 3,
         NODE_STATUS_DELETED   = 4, /* only for debugging */
     } status;
+    uint64_t rdcnt;
     int ref;
 } PCacheNode;
 
@@ -142,10 +143,14 @@ static uint64_t ranges_overlap_size(uint64_t offset1, uint64_t size1,
 static void read_cache_data(PCacheAIOCBRead *acb, PCacheNode *node,
                             uint64_t offset, uint64_t bytes)
 {
+    BDRVPCacheState *s = acb->bs->opaque;
     uint64_t qiov_offs = 0, node_offs = 0;
     uint64_t size;
     uint64_t copy;
 
+    assert(node->status == NODE_STATUS_COMPLETED ||
+           node->status == NODE_STATUS_REMOVE);
+
     if (offset < node->common.offset) {
         qiov_offs = node->common.offset - offset;
     } else {
@@ -156,6 +161,12 @@ static void read_cache_data(PCacheAIOCBRead *acb, PCacheNode *node,
 
     copy = qemu_iovec_from_buf(acb->qiov, qiov_offs,
                                node->data + node_offs, size);
+    node->rdcnt += size;
+    if (node->rdcnt >= node->common.bytes &&
+        node->status == NODE_STATUS_COMPLETED)
+    {
+        rbcache_remove(s->cache, &node->common);
+    }
     assert(copy == size);
 }
 
@@ -314,6 +325,7 @@ static RBCacheNode *pcache_node_alloc(uint64_t offset, uint64_t bytes,
 
     node->data = g_malloc(bytes);
     node->status = NODE_STATUS_NEW;
+    node->rdcnt = 0;
     node->ref = 1;
     QTAILQ_INIT(&node->wait_list);
 
-- 
2.10.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v1 16/18] block/pcache: write through
  2016-11-15  6:36 [Qemu-devel] [PATCH v1 00/18] I/O prefetch cache Pavel Butsykin
                   ` (14 preceding siblings ...)
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 15/18] block/pcache: drop used pcache nodes Pavel Butsykin
@ 2016-11-15  6:37 ` Pavel Butsykin
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 17/18] block/pcache: add tracepoints Pavel Butsykin
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 32+ messages in thread
From: Pavel Butsykin @ 2016-11-15  6:37 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: den, famz, stefanha, mreitz, kwolf, eblake

Write-through is another way to keep the cache up-to-date. Even if this
case will be rare, write to the cache is more optimal than drop-cache.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/pcache.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/block/pcache.c b/block/pcache.c
index df4834a..a592ea0 100644
--- a/block/pcache.c
+++ b/block/pcache.c
@@ -92,6 +92,7 @@ typedef struct PCacheAIOCBWrite {
     Coroutine *co;
     uint64_t offset;
     uint64_t bytes;
+    QEMUIOVector *qiov;
     int ret;
 } PCacheAIOCBWrite;
 
@@ -553,6 +554,24 @@ out:
     return acb.ret;
 }
 
+static void write_cache_data(QEMUIOVector *qiov, PCacheNode *node,
+                             uint64_t offset, uint64_t bytes)
+{
+    uint64_t qiov_offs = 0, node_offs = 0;
+    uint64_t size;
+    uint64_t copy;
+
+    if (offset < node->common.offset) {
+        qiov_offs = node->common.offset - offset;
+    } else {
+        node_offs = offset - node->common.offset;
+    }
+    size = ranges_overlap_size(offset, bytes, node->common.offset,
+                               node->common.bytes);
+    copy = qemu_iovec_to_buf(qiov, qiov_offs, node->data + node_offs, size);
+    assert(copy == size);
+}
+
 static void pcache_aio_write_cb(void *opaque, int ret)
 {
     PCacheAIOCBWrite *acb = opaque;
@@ -578,7 +597,7 @@ static void pcache_aio_write_cb(void *opaque, int ret)
         bytes = end_offs - offset;
 
         if (node->status == NODE_STATUS_COMPLETED) {
-            rbcache_remove(s->cache, &node->common);
+            write_cache_data(acb->qiov, node, acb->offset, acb->bytes);
         }
     } while (end_offs > offset);
 
@@ -596,6 +615,7 @@ static coroutine_fn int pcache_co_pwritev(BlockDriverState *bs, uint64_t offset,
         .bs = bs,
         .offset = offset,
         .bytes = bytes,
+        .qiov = qiov,
     };
 
     bdrv_aio_pwritev(bs->file, offset, qiov, bytes, pcache_aio_write_cb, &acb);
-- 
2.10.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v1 17/18] block/pcache: add tracepoints
  2016-11-15  6:36 [Qemu-devel] [PATCH v1 00/18] I/O prefetch cache Pavel Butsykin
                   ` (15 preceding siblings ...)
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 16/18] block/pcache: write through Pavel Butsykin
@ 2016-11-15  6:37 ` Pavel Butsykin
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 18/18] block/pcache: debug build Pavel Butsykin
  2016-11-15 16:18 ` [Qemu-devel] [PATCH v1 00/18] I/O prefetch cache no-reply
  18 siblings, 0 replies; 32+ messages in thread
From: Pavel Butsykin @ 2016-11-15  6:37 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: den, famz, stefanha, mreitz, kwolf, eblake

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/pcache.c     | 21 +++++++++++++++++++++
 block/trace-events |  9 +++++++++
 2 files changed, 30 insertions(+)

diff --git a/block/pcache.c b/block/pcache.c
index a592ea0..1821557 100644
--- a/block/pcache.c
+++ b/block/pcache.c
@@ -14,6 +14,7 @@
 #include "qapi/error.h"
 #include "qapi/qmp/qstring.h"
 #include "qemu/rbcache.h"
+#include "trace.h"
 
 #define PCACHE_OPT_STATS_SIZE "pcache-stats-size"
 #define PCACHE_OPT_MAX_AIO_SIZE "pcache-max-aio-size"
@@ -206,7 +207,15 @@ static void pcache_aio_read_cb(void *opaque, int ret)
 {
     PCacheAIOCBRead *acb = opaque;
 
+    if (ret < 0) {
+        trace_pcache_aio_read_cb_fail(ret, acb->offset, acb->bytes);
+    }
+
     if (acb->part.qiov.niov != 0) {
+        if (ret < 0) {
+            trace_pcache_aio_read_cb_part_fail(ret, acb->part.offset,
+                                               acb->part.bytes);
+        }
         qemu_iovec_destroy(&acb->part.qiov);
     }
 
@@ -227,6 +236,8 @@ static void pcache_aio_readahead_cb(void *opaque, int ret)
             node->status = NODE_STATUS_COMPLETED;
         } else {
             BDRVPCacheState *s = acb->bs->opaque;
+            trace_pcache_aio_readahead_cb_fail(ret, node->common.offset,
+                                               node->common.bytes);
             rbcache_remove(s->cache, &node->common);
         }
     }
@@ -240,6 +251,9 @@ static void pcache_aio_readahead_cb(void *opaque, int ret)
         if (ret == 0) {
             read_cache_data(acb_read, node, acb_read->offset, acb_read->bytes);
         }
+        trace_pcache_aio_readahead_cb_read_complete(
+            ret, node->common.offset, node->common.bytes,
+            acb_read->offset, acb_read->bytes);
 
         aio_read_complete(acb_read, ret);
     }
@@ -598,6 +612,8 @@ static void pcache_aio_write_cb(void *opaque, int ret)
 
         if (node->status == NODE_STATUS_COMPLETED) {
             write_cache_data(acb->qiov, node, acb->offset, acb->bytes);
+            trace_pcache_aio_write_cb_through(acb->offset, acb->bytes,
+                node->common.offset, node->common.bytes);
         }
     } while (end_offs > offset);
 
@@ -639,6 +655,9 @@ static void pcache_state_init(QemuOpts *opts, BDRVPCacheState *s)
                               RBCACHE_LRU, s);
     s->readahead_size = qemu_opt_get_size(opts, PCACHE_OPT_READAHEAD_SIZE,
                                           PCACHE_DEFAULT_READAHEAD_SIZE);
+
+    trace_pcache_state_init(stats_size, s->max_aio_size, cache_size,
+                            s->readahead_size);
 }
 
 static int pcache_file_open(BlockDriverState *bs, QDict *options, int flags,
@@ -674,6 +693,8 @@ static void pcache_close(BlockDriverState *bs)
 {
     BDRVPCacheState *s = bs->opaque;
 
+    trace_pcache_close(s->req_stats, s->cache);
+
     rbcache_destroy(s->req_stats);
     rbcache_destroy(s->cache);
 }
diff --git a/block/trace-events b/block/trace-events
index 05fa13c..30223da 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -114,3 +114,12 @@ qed_aio_write_data(void *s, void *acb, int ret, uint64_t offset, size_t len) "s
 qed_aio_write_prefill(void *s, void *acb, uint64_t start, size_t len, uint64_t offset) "s %p acb %p start %"PRIu64" len %zu offset %"PRIu64
 qed_aio_write_postfill(void *s, void *acb, uint64_t start, size_t len, uint64_t offset) "s %p acb %p start %"PRIu64" len %zu offset %"PRIu64
 qed_aio_write_main(void *s, void *acb, int ret, uint64_t offset, size_t len) "s %p acb %p ret %d offset %"PRIu64" len %zu"
+
+# block/pcache.c
+pcache_aio_readahead_cb_fail(int ret, uint64_t offset, uint64_t bytes) "ret: %d offset: %"PRIu64" bytes: %"PRIu64
+pcache_aio_readahead_cb_read_complete(int ret, uint64_t node_offset, uint64_t node_bytes, uint64_t read_offset, uint64_t read_bytes) "ret: %d node: %"PRIu64" %"PRIu64" pending read: %"PRIu64" %"PRIu64
+pcache_aio_read_cb_fail(int ret, uint64_t offset, uint64_t bytes) "ret: %d offset: %"PRIu64" bytes: %"PRIu64
+pcache_aio_read_cb_part_fail(int ret, uint64_t offset, uint64_t bytes) "ret: %d offset: %"PRIu64" bytes: %"PRIu64
+pcache_aio_write_cb_through(uint64_t req_offset, uint64_t req_bytes, uint64_t node_offset, uint64_t node_bytes) "request: %"PRIu64" %"PRIu64" node: %"PRIu64" %"PRIu64
+pcache_state_init(uint64_t stats_size, uint64_t max_aio_size, uint64_t cache_size, uint64_t readahead_size) "pool statistics size: %"PRIu64" max aio size: %"PRIu64" cache size: %"PRIu64" readahead size: %"PRIu64
+pcache_close(void *req_stats, void *cache) "pool statistics: %p cache: %p"
-- 
2.10.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v1 18/18] block/pcache: debug build
  2016-11-15  6:36 [Qemu-devel] [PATCH v1 00/18] I/O prefetch cache Pavel Butsykin
                   ` (16 preceding siblings ...)
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 17/18] block/pcache: add tracepoints Pavel Butsykin
@ 2016-11-15  6:37 ` Pavel Butsykin
  2016-11-15 16:18 ` [Qemu-devel] [PATCH v1 00/18] I/O prefetch cache no-reply
  18 siblings, 0 replies; 32+ messages in thread
From: Pavel Butsykin @ 2016-11-15  6:37 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: den, famz, stefanha, mreitz, kwolf, eblake

This debug patch will help to identify problems with the leakage of the cache
memory.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/pcache.c | 36 +++++++++++++++++++++++++++++++++---
 1 file changed, 33 insertions(+), 3 deletions(-)

diff --git a/block/pcache.c b/block/pcache.c
index 1821557..7c3a71b 100644
--- a/block/pcache.c
+++ b/block/pcache.c
@@ -16,6 +16,8 @@
 #include "qemu/rbcache.h"
 #include "trace.h"
 
+// #define PCACHE_DEBUG
+
 #define PCACHE_OPT_STATS_SIZE "pcache-stats-size"
 #define PCACHE_OPT_MAX_AIO_SIZE "pcache-max-aio-size"
 #define PCACHE_OPT_CACHE_SIZE "pcache-full-size"
@@ -66,6 +68,9 @@ typedef struct BDRVPCacheState {
     RBCache *cache;
     uint64_t max_aio_size;
     uint64_t readahead_size;
+#ifdef PCACHE_DEBUG
+    QTAILQ_HEAD(PCacheDeathNodeHead, RBCacheNode) death_node_list;
+#endif
 } BDRVPCacheState;
 
 typedef struct ACBLinkEntry {
@@ -124,13 +129,16 @@ static inline void pcache_node_ref(PCacheNode *node)
     node->ref++;
 }
 
-static void pcache_node_unref(PCacheNode *node)
+static void pcache_node_unref(PCacheNode *node, BDRVPCacheState *s)
 {
     assert(node->ref > 0);
     if (--node->ref == 0) {
         assert(node->status == NODE_STATUS_REMOVE);
         node->status = NODE_STATUS_DELETED;
 
+#ifdef PCACHE_DEBUG
+        QTAILQ_REMOVE(&s->death_node_list, &node->common, entry);
+#endif
         g_free(node->data);
         g_free(node);
     }
@@ -258,7 +266,7 @@ static void pcache_aio_readahead_cb(void *opaque, int ret)
         aio_read_complete(acb_read, ret);
     }
 
-    pcache_node_unref(node);
+    pcache_node_unref(node, acb->bs->opaque);
 
     qemu_coroutine_enter(acb->co);
 }
@@ -330,7 +338,11 @@ static void pcache_node_free(RBCacheNode *rbnode, void *opaque)
            node->status == NODE_STATUS_COMPLETED);
 
     node->status = NODE_STATUS_REMOVE;
-    pcache_node_unref(node);
+#ifdef PCACHE_DEBUG
+    BDRVPCacheState *s = opaque;
+    QTAILQ_INSERT_HEAD(&s->death_node_list, rbnode, entry);
+#endif
+    pcache_node_unref(node, opaque);
 }
 
 static RBCacheNode *pcache_node_alloc(uint64_t offset, uint64_t bytes,
@@ -658,6 +670,9 @@ static void pcache_state_init(QemuOpts *opts, BDRVPCacheState *s)
 
     trace_pcache_state_init(stats_size, s->max_aio_size, cache_size,
                             s->readahead_size);
+#ifdef PCACHE_DEBUG
+    QTAILQ_INIT(&s->death_node_list);
+#endif
 }
 
 static int pcache_file_open(BlockDriverState *bs, QDict *options, int flags,
@@ -697,6 +712,21 @@ static void pcache_close(BlockDriverState *bs)
 
     rbcache_destroy(s->req_stats);
     rbcache_destroy(s->cache);
+
+#ifdef PCACHE_DEBUG
+    RBCacheNode *rbnode, *next;
+    uint64_t cnt = 0;
+    QTAILQ_FOREACH_SAFE(rbnode, &s->death_node_list, entry, next) {
+        PCacheNode *node = container_of(rbnode, PCacheNode, common);
+        QTAILQ_REMOVE(&s->death_node_list, rbnode, entry);
+        fprintf(stderr, "death node: %jd - %jd, ref: %d, stauts: %d\n",
+                rbnode->offset, rbnode->bytes, node->ref, node->status);
+        g_free(node->data);
+        g_free(node);
+        cnt++;
+    }
+    fprintf(stderr, "death node list: %jd\n", cnt);
+#endif
 }
 
 static void pcache_parse_filename(const char *filename, QDict *options,
-- 
2.10.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v1 00/18] I/O prefetch cache
  2016-11-15  6:36 [Qemu-devel] [PATCH v1 00/18] I/O prefetch cache Pavel Butsykin
                   ` (17 preceding siblings ...)
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 18/18] block/pcache: debug build Pavel Butsykin
@ 2016-11-15 16:18 ` no-reply
  18 siblings, 0 replies; 32+ messages in thread
From: no-reply @ 2016-11-15 16:18 UTC (permalink / raw)
  To: pbutsykin; +Cc: famz, qemu-devel, qemu-block, kwolf

Hi,

Your series seems to have some coding style problems. See output below for
more information:

Type: series
Subject: [Qemu-devel] [PATCH v1 00/18] I/O prefetch cache
Message-id: 20161115063715.12561-1-pbutsykin@virtuozzo.com

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

# Useful git options
git config --local diff.renamelimit 0
git config --local diff.renames True

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
    echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
    if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
        failed=1
        echo
    fi
    n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 * [new tag]         patchew/1479224556-19367-1-git-send-email-stefanha@redhat.com -> patchew/1479224556-19367-1-git-send-email-stefanha@redhat.com
 * [new tag]         patchew/20161115063715.12561-1-pbutsykin@virtuozzo.com -> patchew/20161115063715.12561-1-pbutsykin@virtuozzo.com
Switched to a new branch 'test'
5b6050f block/pcache: debug build
e04563e block/pcache: add tracepoints
9cf4997 block/pcache: write through
84c80d3 block/pcache: drop used pcache nodes
a6bf468 backup/pcache: pick up parts of the cache
ad38071 block/pcache: inflight readahead request waiting for aio read
fb7ad9f block/pcache: add reading data from the cache
4850ad3 block/pcache: cache invalidation on AIO write requests
093b48c block/pcache: skip readahead for unallocated clusters
27fed87 block/pcache: add AIO readahead
8d1e3b6 block/pcache: updating statistics for overlapping requests
73db1d5 block/pcache: skip large aio read
5531940 block/pcache: statistics collection read requests
1f97985 tests/test-rbcache: add test cases
c523c4b util/rbcache: range-based cache core
ab4ef0c util/rbtree: add rbtree from linux kernel
929e891 block/pcache: empty pcache driver filter
99e7739 block/io: add bdrv_aio_{preadv, pwritev}

=== OUTPUT BEGIN ===
Checking PATCH 1/18: block/io: add bdrv_aio_{preadv, pwritev}...
Checking PATCH 2/18: block/pcache: empty pcache driver filter...
Checking PATCH 3/18: util/rbtree: add rbtree from linux kernel...
Checking PATCH 4/18: util/rbcache: range-based cache core...
Checking PATCH 5/18: tests/test-rbcache: add test cases...
Checking PATCH 6/18: block/pcache: statistics collection read requests...
Checking PATCH 7/18: block/pcache: skip large aio read...
Checking PATCH 8/18: block/pcache: updating statistics for overlapping requests...
Checking PATCH 9/18: block/pcache: add AIO readahead...
Checking PATCH 10/18: block/pcache: skip readahead for unallocated clusters...
Checking PATCH 11/18: block/pcache: cache invalidation on AIO write requests...
Checking PATCH 12/18: block/pcache: add reading data from the cache...
Checking PATCH 13/18: block/pcache: inflight readahead request waiting for aio read...
Checking PATCH 14/18: backup/pcache: pick up parts of the cache...
Checking PATCH 15/18: block/pcache: drop used pcache nodes...
Checking PATCH 16/18: block/pcache: write through...
Checking PATCH 17/18: block/pcache: add tracepoints...
Checking PATCH 18/18: block/pcache: debug build...
ERROR: do not use C99 // comments
#20: FILE: block/pcache.c:19:
+// #define PCACHE_DEBUG

total: 1 errors, 0 warnings, 84 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

=== OUTPUT END ===

Test command exited with code: 1


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@freelists.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v1 01/18] block/io: add bdrv_aio_{preadv, pwritev}
  2016-11-15  6:36 ` [Qemu-devel] [PATCH v1 01/18] block/io: add bdrv_aio_{preadv, pwritev} Pavel Butsykin
@ 2016-11-23 14:28   ` Kevin Wolf
  2016-11-24 10:58     ` Pavel Butsykin
  0 siblings, 1 reply; 32+ messages in thread
From: Kevin Wolf @ 2016-11-23 14:28 UTC (permalink / raw)
  To: Pavel Butsykin
  Cc: qemu-devel, qemu-block, den, famz, stefanha, mreitz, eblake

Am 15.11.2016 um 07:36 hat Pavel Butsykin geschrieben:
> It's just byte-based wrappers over bdrv_co_aio_prw_vector(), which provide
>  a byte-based interface for AIO read/write.
> 
> Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>

I'm in the process to phase out the last users of bdrv_aio_*() so that
this set of interfaces can be removed. I'm doing this because it's an
unnecessary redundancy, we have too many wrapper functions that expose
the same functionality with different syntax. So let's not add new
users.

At first sight, you don't even seem to use bdrv_aio_preadv() for actual
parallelism, but you often have a pattern like this:

    void foo_cb(void *opaque)
    {
        ...
        qemu_coroutine_enter(acb->co);
    }

    void caller()
    {
        ...
        acb = bdrv_aio_preadv(...);
        qemu_coroutine_yield();
    }

The code will actually become a lot simpler if you use bdrv_co_preadv()
instead because you don't have to have a callback, but you get pure
sequential code.

The part that actually has some parallelism, pcache_readahead_request(),
already creates its own coroutine, so it runs in the background without
using callback-style interfaces.

Kevin

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v1 02/18] block/pcache: empty pcache driver filter
  2016-11-15  6:36 ` [Qemu-devel] [PATCH v1 02/18] block/pcache: empty pcache driver filter Pavel Butsykin
@ 2016-11-23 15:15   ` Kevin Wolf
  2016-11-24 15:48     ` Pavel Butsykin
  0 siblings, 1 reply; 32+ messages in thread
From: Kevin Wolf @ 2016-11-23 15:15 UTC (permalink / raw)
  To: Pavel Butsykin
  Cc: qemu-devel, qemu-block, den, famz, stefanha, mreitz, eblake

Am 15.11.2016 um 07:36 hat Pavel Butsykin geschrieben:
> The basic version of pcache driver for easy preparation of a patch set.
> 
> Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
> ---
>  block/Makefile.objs |   1 +
>  block/pcache.c      | 144 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 145 insertions(+)
>  create mode 100644 block/pcache.c
> 
> diff --git a/block/Makefile.objs b/block/Makefile.objs
> index 7d4031d..c60f680 100644
> --- a/block/Makefile.objs
> +++ b/block/Makefile.objs
> @@ -4,6 +4,7 @@ block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
>  block-obj-y += qed-check.o
>  block-obj-y += vhdx.o vhdx-endian.o vhdx-log.o
>  block-obj-y += quorum.o
> +block-obj-y += pcache.o
>  block-obj-y += parallels.o blkdebug.o blkverify.o blkreplay.o
>  block-obj-y += block-backend.o snapshot.o qapi.o
>  block-obj-$(CONFIG_WIN32) += raw-win32.o win32-aio.o
> diff --git a/block/pcache.c b/block/pcache.c
> new file mode 100644
> index 0000000..59461df
> --- /dev/null
> +++ b/block/pcache.c
> @@ -0,0 +1,144 @@
> +/*
> + * Prefetch cache driver filter
> + *
> + * Copyright (C) 2015-2016 Parallels IP Holdings GmbH. All rights reserved.

"All rights reserved" doesn't really agree with licensing under the GPL.
It would be preferable to remove this note to avoid any ambiguity.

> + *
> + * Author: Pavel Butsykin <pbutsykin@virtuozzo.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or
> + * later.  See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "block/block_int.h"
> +#include "qapi/error.h"
> +#include "qapi/qmp/qstring.h"
> +
> +
> +static QemuOptsList runtime_opts = {
> +    .name = "pcache",
> +    .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head),
> +    .desc = {
> +        {
> +            .name = "x-image",
> +            .type = QEMU_OPT_STRING,
> +            .help = "[internal use only, will be removed]",
> +        },

blkdebug/blkverify have this because they have to support legacy syntax
like -drive file=blkdebug:blkdebug.conf:test.img, i.e. it has to deal
with filenames.

Here we don't have to support a legacy syntax, so I would completely
avoid this from the beginning. You already support the "image" option,
which should be good enough.

> +        { /* end of list */ }
> +    },
> +};
> +
> +typedef struct PCacheAIOCB {
> +    Coroutine *co;
> +    int ret;
> +} PCacheAIOCB;
> +
> +static void pcache_aio_cb(void *opaque, int ret)
> +{
> +    PCacheAIOCB *acb = opaque;
> +
> +    acb->ret = ret;
> +    qemu_coroutine_enter(acb->co);
> +}
> +
> +static coroutine_fn int pcache_co_preadv(BlockDriverState *bs, uint64_t offset,
> +                                         uint64_t bytes, QEMUIOVector *qiov,
> +                                         int flags)
> +{
> +    PCacheAIOCB acb = {
> +        .co = qemu_coroutine_self(),
> +    };
> +
> +    bdrv_aio_preadv(bs->file, offset, qiov, bytes, pcache_aio_cb, &acb);
> +
> +    qemu_coroutine_yield();
> +
> +    return acb.ret;
> +}

As I commented on patch 1, all of this can be replaced by a single line:

    return bdrv_co_preadv(bs->file, offset, bytes, qiov, flags);

> +static coroutine_fn int pcache_co_pwritev(BlockDriverState *bs, uint64_t offset,
> +                                          uint64_t bytes, QEMUIOVector *qiov,
> +                                          int flags)
> +{
> +    PCacheAIOCB acb = {
> +        .co = qemu_coroutine_self(),
> +    };
> +
> +    bdrv_aio_pwritev(bs->file, offset, qiov, bytes, pcache_aio_cb, &acb);
> +
> +    qemu_coroutine_yield();
> +
> +    return acb.ret;
> +}

Same here.

> +static int pcache_file_open(BlockDriverState *bs, QDict *options, int flags,
> +                            Error **errp)
> +{
> +    QemuOpts *opts;
> +    Error *local_err = NULL;
> +    int ret = 0;
> +
> +    opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
> +    qemu_opts_absorb_qdict(opts, options, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        ret = -EINVAL;
> +        goto fail;
> +    }
> +
> +    assert(bs->file == NULL);
> +    bs->file = bdrv_open_child(qemu_opt_get(opts, "x-image"), options,
> +                               "image", bs, &child_format, false, &local_err);

When removing x-image, the first parameter becomes NULL here.

> +    if (local_err) {
> +        ret = -EINVAL;
> +        error_propagate(errp, local_err);
> +    }
> +fail:
> +    qemu_opts_del(opts);
> +    return ret;
> +}
> +
> +static void pcache_close(BlockDriverState *bs)
> +{
> +}
> +
> +static void pcache_parse_filename(const char *filename, QDict *options,
> +                                  Error **errp)
> +{
> +    qdict_put(options, "x-image", qstring_from_str(filename));
> +}

This one goes away.

> +static int64_t pcache_getlength(BlockDriverState *bs)
> +{
> +    return bdrv_getlength(bs->file->bs);
> +}
> +
> +static bool pcache_recurse_is_first_non_filter(BlockDriverState *bs,
> +                                               BlockDriverState *candidate)
> +{
> +    return bdrv_recurse_is_first_non_filter(bs->file->bs, candidate);
> +}
> +
> +static BlockDriver bdrv_pcache = {
> +    .format_name                        = "pcache",
> +    .protocol_name                      = "pcache",

Remove this one, it would only be used for something like
"pcache:<filename>" syntax.

> +    .instance_size                      = 0,
> +
> +    .bdrv_parse_filename                = pcache_parse_filename,
> +    .bdrv_file_open                     = pcache_file_open,
> +    .bdrv_close                         = pcache_close,
> +    .bdrv_getlength                     = pcache_getlength,
> +
> +    .bdrv_co_preadv                     = pcache_co_preadv,
> +    .bdrv_co_pwritev                    = pcache_co_pwritev,
> +
> +    .is_filter                          = true,
> +    .bdrv_recurse_is_first_non_filter   = pcache_recurse_is_first_non_filter,
> +};
> +
> +static void bdrv_cache_init(void)
> +{
> +    bdrv_register(&bdrv_pcache);
> +}
> +
> +block_init(bdrv_cache_init);

You should also add pcache to the QAPI schema, so that blockdev-add can
create instances of it.

Kevin

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v1 04/18] util/rbcache: range-based cache core
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 04/18] util/rbcache: range-based cache core Pavel Butsykin
@ 2016-11-23 21:25   ` Kevin Wolf
  2016-11-24 19:23     ` Pavel Butsykin
  0 siblings, 1 reply; 32+ messages in thread
From: Kevin Wolf @ 2016-11-23 21:25 UTC (permalink / raw)
  To: Pavel Butsykin
  Cc: qemu-devel, qemu-block, den, famz, stefanha, mreitz, eblake

Am 15.11.2016 um 07:37 hat Pavel Butsykin geschrieben:
> RBCache provides functionality to cache the data from block devices
> (basically). The range here is used as the main key for searching and storing
> data. The cache is based on red-black trees, so basic operations search,
> insert, delete are performed for O(log n).
> 
> It is important to note that QEMU usually does not require a data cache, but
> in reality, there are already some cases where a cache of small amounts can
> increase performance, so as the data structure was selected red-black trees,
> this is a fairly simple data structure and show high efficiency on a small
> number of elements. Therefore, when the minimum range is 512 bytes, the
> recommended size of the cache memory no more than 8-16mb.  Also note
> that this cache implementation allows to store ranges of different lengths
> without alignment.
> 
> Generic cache core can easily be used to implement different caching policies at
> the block level, such as read-ahed. Also it can be used in some special cases,
> for example for caching data in qcow2 when sequential allocating writes to image
> with backing file.
> 
> Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
> ---
>  MAINTAINERS            |   6 ++
>  include/qemu/rbcache.h | 105 +++++++++++++++++++++
>  util/Makefile.objs     |   1 +
>  util/rbcache.c         | 246 +++++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 358 insertions(+)
>  create mode 100644 include/qemu/rbcache.h
>  create mode 100644 util/rbcache.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index ddf797b..cb74802 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1365,6 +1365,12 @@ F: include/qemu/rbtree.h
>  F: include/qemu/rbtree_augmented.h
>  F: util/rbtree.c
>  
> +Range-Based Cache
> +M: Denis V. Lunev <den@openvz.org>
> +S: Supported
> +F: include/qemu/rbcache.h
> +F: util/rbcache.c
> +
>  UUID
>  M: Fam Zheng <famz@redhat.com>
>  S: Supported
> diff --git a/include/qemu/rbcache.h b/include/qemu/rbcache.h
> new file mode 100644
> index 0000000..c8f0a9f
> --- /dev/null
> +++ b/include/qemu/rbcache.h
> @@ -0,0 +1,105 @@
> +/*
> + * QEMU Range-Based Cache core
> + *
> + * Copyright (C) 2015-2016 Parallels IP Holdings GmbH. All rights reserved.
> + *
> + * Author: Pavel Butsykin <pbutsykin@virtuozzo.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or
> + * later.  See the COPYING file in the top-level directory.
> + */
> +
> +#ifndef RBCACHE_H
> +#define RBCACHE_H
> +
> +#include "qemu/rbtree.h"
> +#include "qemu/queue.h"
> +
> +typedef struct RBCacheNode {
> +    struct RbNode rb_node;
> +    uint64_t offset;
> +    uint64_t bytes;
> +    QTAILQ_ENTRY(RBCacheNode) entry;
> +} RBCacheNode;
> +
> +typedef struct RBCache RBCache;
> +
> +typedef RBCacheNode *RBNodeAlloc(uint64_t offset, uint64_t bytes, void *opaque);
> +typedef void RBNodeFree(RBCacheNode *node, void *opaque);

Maybe worth comments describing what these functions do apart from
g_new()/g_free()? I assume that offset and bytes must be initialised
from the parameters. Should rb_node and entry be zeroed?

> +
> +enum eviction_type {
> +    RBCACHE_FIFO,
> +    RBCACHE_LRU,
> +};
> +
> +/**
> + * rbcache_search:
> + * @rbcache: the cache object.
> + * @offset: the start of the range.
> + * @bytes: the size of the range.
> + *
> + * Returns the node corresponding to the range(offset, bytes),
> + * or NULL if the node was not found.
> + */
> +void *rbcache_search(RBCache *rbcache, uint64_t offset, uint64_t bytes);

What if the range covers multiple nodes? Is it defined which of the
nodes we return or do you just get any?

Why does this function (and the following ones) return void* rather than
RBCacheNode* if they are supposed to return a node?

> +/**
> + * rbcache_insert:
> + * @rbcache: the cache object.
> + * @node: a new node for the cache.
> + *
> + * Returns the new node, or old node if the node already exists.
> + */
> +void *rbcache_insert(RBCache *rbcache, RBCacheNode *node);

What does "if the node already exists" mean? If @node (the very same
object) is already stored in the cache object, or if a node describing
the same range already exists?

> +/**
> + * rbcache_search_and_insert:
> + * @rbcache: the cache object.
> + * @offset: the start of the range.
> + * @bytes: the size of the range.
> + *
> + * rbcache_search_and_insert() is like rbcache_insert(), except that a new node
> + * is allocated inside the function. Returns the new node, or old node if the
> + * node already exists.
> + */
> +void *rbcache_search_and_insert(RBCache *rbcache, uint64_t offset,
> +                                uint64_t byte);

What happens if a node exists, but only for part of the range?

> +/**
> + * rbcache_remove:
> + * @rbcache: the cache object.
> + * @node: the node to remove.
> + *
> + * Removes the cached range owned by the node.
> + */
> +void rbcache_remove(RBCache *rbcache, RBCacheNode *node);
> +
> +RBCacheNode *rbcache_node_alloc(RBCache *rbcache, uint64_t offset,
> +                                uint64_t bytes);
> +void rbcache_node_free(RBCache *rbcache, RBCacheNode *node);

No comment for these two?

> +
> +/**
> + * rbcache_create:
> + * @alloc: callback to allocation node, allows to upgrade allocate and expand
> + *         the capabilities of the node.
> + * @free: callback to release node, must be used together with alloc callback.
> + * @limit_size: maximum cache size in bytes.
> + * @eviction_type: method of memory limitation
> + * @opaque: the opaque pointer to pass to the callback.
> + *
> + * Returns the cache object.
> + */
> +RBCache *rbcache_create(RBNodeAlloc *alloc, RBNodeFree *free,
> +                        uint64_t limit_size, int eviction_type, void *opaque);
> +
> +/**
> + * rbcache_destroy:
> + * @rbcache: the cache object.
> + *
> + * Cleanup the cache object created with rbcache_create().
> + */
> +void rbcache_destroy(RBCache *rbcache);
> +
> +#endif /* RBCACHE_H */
> diff --git a/util/Makefile.objs b/util/Makefile.objs
> index 727a567..9fb0de6 100644
> --- a/util/Makefile.objs
> +++ b/util/Makefile.objs
> @@ -38,3 +38,4 @@ util-obj-y += qdist.o
>  util-obj-y += qht.o
>  util-obj-y += range.o
>  util-obj-y += rbtree.o
> +util-obj-y += rbcache.o
> diff --git a/util/rbcache.c b/util/rbcache.c
> new file mode 100644
> index 0000000..05cfa5a
> --- /dev/null
> +++ b/util/rbcache.c
> @@ -0,0 +1,246 @@
> +/*
> + * QEMU Range-Based Cache core
> + *
> + * Copyright (C) 2015-2016 Parallels IP Holdings GmbH. All rights reserved.
> + *
> + * Author: Pavel Butsykin <pbutsykin@virtuozzo.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or
> + * later.  See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/rbcache.h"
> +
> +/* RBCache provides functionality to cache the data from block devices
> + * (basically). The range here is used as the main key for searching and storing
> + * data. The cache is based on red-black trees, so basic operations search,
> + * insert, delete are performed for O(log n).
> + *
> + * It is important to note that QEMU usually does not require a data cache, but
> + * in reality, there are already some cases where a cache of small amounts can
> + * increase performance, so as the data structure was selected red-black trees,
> + * this is a quite simple data structure and show high efficiency on a small
> + * number of elements. Therefore, when the minimum range is 512 bytes, the
> + * recommended size of the cache memory no more than 8-16mb. Also note that this
> + * cache implementation allows to store ranges of different lengths without
> + * alignment.
> + */
> +
> +struct RBCache {
> +    struct RbRoot root;
> +    RBNodeAlloc *alloc;
> +    RBNodeFree  *free;
> +    uint64_t limit_size;
> +    uint64_t cur_size;
> +    int eviction_type;

s/int/enum eviction_type/

> +    void *opaque;
> +
> +    QTAILQ_HEAD(RBCacheNodeHead, RBCacheNode) queue;
> +};
> +
> +static int node_key_cmp(const RBCacheNode *node1, const RBCacheNode *node2)
> +{
> +    assert(node1 != NULL);
> +    assert(node2 != NULL);
> +
> +    if (node1->offset >= node2->offset + node2->bytes) {
> +        return 1;
> +    }
> +    if (node1->offset + node1->bytes <= node2->offset) {
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +static RBCacheNode *node_previous(RBCacheNode* node, uint64_t target_offset)

It would be nice to describe this function in a comment.

>From what I understand, it goes backwards in the tree until the last
node whose range end is larger than target_offset is reached.

> +{
> +    while (node) {
> +        struct RbNode *prev_rb_node = rb_prev(&node->rb_node);
> +        RBCacheNode *prev_node;
> +        if (prev_rb_node == NULL) {
> +            break;
> +        }
> +        prev_node = container_of(prev_rb_node, RBCacheNode, rb_node);
> +        if (prev_node->offset + prev_node->bytes <= target_offset) {
> +            break;
> +        }
> +        node = prev_node;
> +    }
> +
> +    assert(node != NULL);
> +
> +    return node;
> +}
> +
> +RBCacheNode *rbcache_node_alloc(RBCache *rbcache, uint64_t offset,
> +                                uint64_t bytes)
> +{
> +    RBCacheNode *node;
> +
> +    if (rbcache->alloc) {
> +        node = rbcache->alloc(offset, bytes, rbcache->opaque);
> +    } else {
> +        node = g_malloc(sizeof(*node));

g_new() is preferred.

> +    }
> +
> +    node->offset = offset;
> +    node->bytes  = bytes;

Okay, so the callback doesn't have to set offset/bytes because it's
already set here, and it can be NULL. Both things to be mentioned in the
documentation.

> +    return node;
> +}
> +
> +void rbcache_node_free(RBCache *rbcache, RBCacheNode *node)
> +{
> +    if (rbcache->free) {
> +        rbcache->free(node, rbcache->opaque);
> +    } else {
> +        g_free(node);
> +    }
> +}
> +
> +static void rbcache_try_shrink(RBCache *rbcache)
> +{
> +    while (rbcache->cur_size > rbcache->limit_size) {
> +        RBCacheNode *node;
> +        assert(!QTAILQ_EMPTY(&rbcache->queue));
> +
> +        node = QTAILQ_LAST(&rbcache->queue, RBCacheNodeHead);
> +
> +        rbcache_remove(rbcache, node);
> +    }
> +}
> +
> +static inline void node_move_in_queue(RBCache *rbcache, RBCacheNode *node)
> +{
> +    if (rbcache->eviction_type == RBCACHE_LRU) {
> +        QTAILQ_REMOVE(&rbcache->queue, node, entry);
> +        QTAILQ_INSERT_HEAD(&rbcache->queue, node, entry);
> +    }
> +}
> +
> +static RBCacheNode *node_insert(RBCache *rbcache, RBCacheNode *node, bool alloc)

Here as well there seem to be a few different cases and it would be good
to have a comment that specifies the behaviour for each of them. It
would also give me something to review against.

> +{
> +    struct RbNode **new, *parent = NULL;
> +
> +    assert(rbcache != NULL);
> +    assert(node->bytes != 0);
> +
> +    /* Figure out where to put new node */
> +    new = &(rbcache->root.rb_node);
> +    while (*new) {
> +        RBCacheNode *this = container_of(*new, RBCacheNode, rb_node);
> +        int result = node_key_cmp(node, this);
> +        if (result == 0) {
> +            this = node_previous(this, node->offset);

So in case of partial overlaps, the existing overlapping node with the
lowest offset is returned, but the range is not adjusted. Correct?

> +            node_move_in_queue(rbcache, this);
> +            return this;
> +        }
> +        parent = *new;
> +        new = result < 0 ? &((*new)->rb_left) : &((*new)->rb_right);
> +    }
> +
> +    if (alloc) {
> +        node = rbcache_node_alloc(rbcache, node->offset, node->bytes);
> +    }
> +    /* Add new node and rebalance tree. */
> +    rb_link_node(&node->rb_node, parent, new);
> +    rb_insert_color(&node->rb_node, &rbcache->root);
> +
> +    rbcache->cur_size += node->bytes;
> +
> +    rbcache_try_shrink(rbcache);
> +
> +    QTAILQ_INSERT_HEAD(&rbcache->queue, node, entry);
> +
> +    return node;
> +}
> +
> +void *rbcache_search(RBCache *rbcache, uint64_t offset, uint64_t bytes)
> +{
> +    struct RbNode *rb_node;
> +    RBCacheNode node = {
> +        .offset = offset,
> +        .bytes  = bytes,
> +    };
> +
> +    assert(rbcache != NULL);
> +
> +    rb_node = rbcache->root.rb_node;
> +    while (rb_node) {
> +        RBCacheNode *this = container_of(rb_node, RBCacheNode, rb_node);
> +        int32_t result = node_key_cmp(&node, this);
> +        if (result == 0) {
> +            this = node_previous(this, offset);
> +            node_move_in_queue(rbcache, this);
> +            return this;
> +        }
> +        rb_node = result < 0 ? rb_node->rb_left : rb_node->rb_right;
> +    }
> +    return NULL;
> +}
> +
> +void *rbcache_insert(RBCache *rbcache, RBCacheNode *node)
> +{
> +    return node_insert(rbcache, node, false);
> +}
> +
> +void *rbcache_search_and_insert(RBCache *rbcache, uint64_t offset,
> +                                uint64_t bytes)
> +{
> +    RBCacheNode node = {
> +        .offset = offset,
> +        .bytes  = bytes,
> +    };
> +
> +    return node_insert(rbcache, &node, true);
> +}
> +
> +void rbcache_remove(RBCache *rbcache, RBCacheNode *node)
> +{
> +    assert(rbcache->cur_size >= node->bytes);
> +
> +    rbcache->cur_size -= node->bytes;
> +    rb_erase(&node->rb_node, &rbcache->root);
> +
> +    QTAILQ_REMOVE(&rbcache->queue, node, entry);
> +
> +    rbcache_node_free(rbcache, node);
> +}
> +
> +RBCache *rbcache_create(RBNodeAlloc *alloc, RBNodeFree *free,
> +                        uint64_t limit_size, int eviction_type, void *opaque)
> +{
> +    RBCache *rbcache = g_malloc(sizeof(*rbcache));

Again, g_new() is preferred.

> +
> +    /* We can't use only one callback, or both or neither */
> +    assert(!(!alloc ^ !free));
> +
> +    rbcache->root   = RB_ROOT;
> +    rbcache->alloc  = alloc;
> +    rbcache->free   = free;
> +    rbcache->opaque = opaque;
> +    rbcache->limit_size = limit_size;
> +    rbcache->cur_size   = 0;
> +    rbcache->eviction_type = eviction_type;
> +
> +    QTAILQ_INIT(&rbcache->queue);

And actually, it would be good to make sure that all fields are zeroed
even if the struct is extended. You can do that with g_new0(), or -
that's what I would do - write the initialisation like this:

    *rbcache = (RBCache) {
        .root       = RB_ROOT,
        .alloc      = alloc,
        ...
    };

If you align all values on a column (which is appreciated), please do so
consistently and align everything to the same column as the longest
field name (eviction_type).

> +
> +    return rbcache;
> +}
> +
> +void rbcache_destroy(RBCache *rbcache)
> +{
> +    RBCacheNode *node, *next;
> +
> +    assert(rbcache != NULL);
> +
> +    QTAILQ_FOREACH_SAFE(node, &rbcache->queue, entry, next) {
> +        QTAILQ_REMOVE(&rbcache->queue, node, entry);
> +        rbcache_node_free(rbcache, node);
> +    }
> +
> +    g_free(rbcache);
> +}

Kevin

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v1 01/18] block/io: add bdrv_aio_{preadv, pwritev}
  2016-11-23 14:28   ` Kevin Wolf
@ 2016-11-24 10:58     ` Pavel Butsykin
  2016-11-24 12:36       ` Kevin Wolf
  0 siblings, 1 reply; 32+ messages in thread
From: Pavel Butsykin @ 2016-11-24 10:58 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, qemu-block, den, famz, stefanha, mreitz, eblake

On 23.11.2016 17:28, Kevin Wolf wrote:
> Am 15.11.2016 um 07:36 hat Pavel Butsykin geschrieben:
>> It's just byte-based wrappers over bdrv_co_aio_prw_vector(), which provide
>>   a byte-based interface for AIO read/write.
>>
>> Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
>
> I'm in the process to phase out the last users of bdrv_aio_*() so that
> this set of interfaces can be removed. I'm doing this because it's an
> unnecessary redundancy, we have too many wrapper functions that expose
> the same functionality with different syntax. So let's not add new
> users.
>
> At first sight, you don't even seem to use bdrv_aio_preadv() for actual
> parallelism, but you often have a pattern like this:
>
>      void foo_cb(void *opaque)
>      {
>          ...
>          qemu_coroutine_enter(acb->co);
>      }
>
>      void caller()
>      {
>          ...
>          acb = bdrv_aio_preadv(...);
>          qemu_coroutine_yield();
>      }
>
> The code will actually become a lot simpler if you use bdrv_co_preadv()
> instead because you don't have to have a callback, but you get pure
> sequential code.
>
> The part that actually has some parallelism, pcache_readahead_request(),
> already creates its own coroutine, so it runs in the background without
> using callback-style interfaces.

I used bdrv_co_preadv(), because it conveniently solves the partial
cache hit. To solve the partial cache hit, we need to split a request
into smaller parts, make asynchronous requests and wait for all
requests in one place.

Do you propose to create a coroutine for each part of request? It
seemed to me that bdrv_co_preadv() is a wrapper that allows us to get
rid of the same code.

> Kevin
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v1 05/18] tests/test-rbcache: add test cases
  2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 05/18] tests/test-rbcache: add test cases Pavel Butsykin
@ 2016-11-24 12:20   ` Kevin Wolf
  2016-11-25  9:58     ` Pavel Butsykin
  0 siblings, 1 reply; 32+ messages in thread
From: Kevin Wolf @ 2016-11-24 12:20 UTC (permalink / raw)
  To: Pavel Butsykin
  Cc: qemu-devel, qemu-block, den, famz, stefanha, mreitz, eblake

Am 15.11.2016 um 07:37 hat Pavel Butsykin geschrieben:
> Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
> ---
>  tests/Makefile.include |   3 +
>  tests/test-rbcache.c   | 308 +++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 311 insertions(+)
>  create mode 100644 tests/test-rbcache.c
> 
> diff --git a/tests/Makefile.include b/tests/Makefile.include
> index 8162f6f..36bb472 100644
> --- a/tests/Makefile.include
> +++ b/tests/Makefile.include
> @@ -54,6 +54,8 @@ check-unit-y += tests/test-hbitmap$(EXESUF)
>  gcov-files-test-hbitmap-y = blockjob.c
>  check-unit-y += tests/test-blockjob$(EXESUF)
>  check-unit-y += tests/test-blockjob-txn$(EXESUF)
> +gcov-files-test-rbcache-y = util/rbcache.c
> +check-unit-y += tests/test-rbcache$(EXESUF)
>  check-unit-y += tests/test-x86-cpuid$(EXESUF)
>  # all code tested by test-x86-cpuid is inside topology.h
>  gcov-files-test-x86-cpuid-y =
> @@ -481,6 +483,7 @@ tests/test-blockjob-txn$(EXESUF): tests/test-blockjob-txn.o $(test-block-obj-y)
>  tests/test-thread-pool$(EXESUF): tests/test-thread-pool.o $(test-block-obj-y)
>  tests/test-iov$(EXESUF): tests/test-iov.o $(test-util-obj-y)
>  tests/test-hbitmap$(EXESUF): tests/test-hbitmap.o $(test-util-obj-y)
> +tests/test-rbcache$(EXESUF): tests/test-rbcache.o $(test-util-obj-y)
>  tests/test-x86-cpuid$(EXESUF): tests/test-x86-cpuid.o
>  tests/test-xbzrle$(EXESUF): tests/test-xbzrle.o migration/xbzrle.o page_cache.o $(test-util-obj-y)
>  tests/test-cutils$(EXESUF): tests/test-cutils.o util/cutils.o
> diff --git a/tests/test-rbcache.c b/tests/test-rbcache.c
> new file mode 100644
> index 0000000..1c72bfa
> --- /dev/null
> +++ b/tests/test-rbcache.c
> @@ -0,0 +1,308 @@
> +/*
> + * QEMU Range-Based Cache core
> + *
> + * Copyright (C) 2015-2016 Parallels IP Holdings GmbH. All rights reserved.
> + *
> + * Author: Pavel Butsykin <pbutsykin@virtuozzo.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or
> + * later.  See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/rbcache.h"
> +
> +typedef struct TestRBCacheData {
> +    RBCache *cache;
> +} TestRBCacheData;
> +
> +typedef struct TestRBCacheConfig {
> +    uint64_t limit_size;
> +    int eviction_type;
> +    RBNodeAlloc *alloc;
> +    RBNodeFree  *free;
> +    void *opaque;
> +} TestRBCacheConfig;
> +
> +#define KB(_n) ((_n) << 10)
> +#define MB(_n) ((_n) << 20)
> +
> +#define OFFSET1 0
> +#define SIZE1   KB(1)
> +
> +#define OFFSET2 KB(1)
> +#define SIZE2   KB(2)
> +
> +#define OFFSET3 KB(15)
> +#define SIZE3   KB(1)
> +
> +#define OFFSET4 KB(7)
> +#define SIZE4   KB(7)
> +
> +#define OFFSET5 KB(2)
> +#define SIZE5   KB(8)

Visualised, we test these requests:

1: *
2:  **
3:                *
4:        *******
5:   ********

You test inserting the only element, inserting after the last element,
inserting in the middle and inserting something that overlaps two other
requests at its start and end.

That's a good start, but it might be worth testing more scenarios:

- Inserting a new first element to a non-empty cache
- Overlapping only at the start
- Overlapping only at the end
- Overlapping in the middle (i.e. including existing ranges as a
  subset)
    * With only one node
    * With multiple nodes (like adding offset=2, size=16kb here)


> +
> +static void test_rbcache_init(TestRBCacheData *data, const void *ctx)
> +{
> +    g_assert_nonnull(data->cache);
> +}
> +
> +static void test_rbcache_insert(TestRBCacheData *data, const void *ctx)
> +{
> +    RBCacheNode *node1 = rbcache_node_alloc(data->cache, OFFSET1, SIZE1);
> +    RBCacheNode *node2 = rbcache_node_alloc(data->cache, OFFSET2, SIZE2);
> +    RBCacheNode *node3 = rbcache_node_alloc(data->cache, OFFSET3, SIZE3);
> +    RBCacheNode *node4 = rbcache_node_alloc(data->cache, OFFSET4, SIZE4);
> +    RBCacheNode *node5 = rbcache_node_alloc(data->cache, OFFSET5, SIZE5);
> +    RBCacheNode *node;
> +
> +    node = rbcache_insert(data->cache, node1);
> +    g_assert_true(node == node1);
> +
> +    node = rbcache_insert(data->cache, node2);
> +    g_assert_true(node == node2);
> +
> +    node = rbcache_insert(data->cache, node3);
> +    g_assert_true(node == node3);
> +
> +    node = rbcache_insert(data->cache, node4);
> +    g_assert_true(node == node4);
> +
> +    node = rbcache_insert(data->cache, node5);
> +    g_assert_true(node != node5);

You can test for the right result instead:

    g_assert_true(node == node2);


> +    rbcache_node_free(data->cache, node5);
> +}
> +
> +static void test_rbcache_search(TestRBCacheData *data, const void *ctx)
> +{
> +    RBCacheNode *node;
> +
> +    test_rbcache_insert(data, ctx);
> +
> +    node = rbcache_search(data->cache, OFFSET1, SIZE1);
> +    g_assert_nonnull(node);
> +    g_assert_cmpuint(node->offset, ==, OFFSET1);
> +    g_assert_cmpuint(node->bytes, ==, SIZE1);
> +
> +    node = rbcache_search(data->cache, OFFSET2 + KB(1), SIZE1);

Intentional that you use SIZE1 here even though we're looking at node 2?

> +    g_assert_nonnull(node);
> +    g_assert_cmpuint(node->offset, ==, OFFSET2);
> +    g_assert_cmpuint(node->bytes, ==, SIZE2);
> +
> +    node = rbcache_search(data->cache, OFFSET5, SIZE5);
> +    g_assert_nonnull(node);
> +    g_assert_cmpuint(node->offset, ==, OFFSET2);
> +    g_assert_cmpuint(node->bytes, ==, SIZE2);
> +
> +    node = rbcache_search(data->cache, OFFSET5 + KB(2), SIZE5);
> +    g_assert_nonnull(node);
> +    g_assert_cmpuint(node->offset, ==, OFFSET4);
> +    g_assert_cmpuint(node->bytes, ==, SIZE4);
> +
> +    node = rbcache_search(data->cache, OFFSET3 + SIZE3, SIZE3);
> +    g_assert_null(node);
> +}
> +
> +static void test_rbcache_search_and_insert(TestRBCacheData *data,
> +                                           const void *ctx)
> +{
> +    RBCacheNode *node;
> +
> +    node = rbcache_search_and_insert(data->cache, OFFSET1, SIZE1);
> +    g_assert_nonnull(node);
> +    g_assert_cmpuint(node->offset, ==, OFFSET1);
> +    g_assert_cmpuint(node->bytes, ==, SIZE1);
> +
> +    node = rbcache_search_and_insert(data->cache, OFFSET2, SIZE2);
> +    g_assert_nonnull(node);
> +    g_assert_cmpuint(node->offset, ==, OFFSET2);
> +    g_assert_nonnull(node);

I think you wanted to check node->bytes here instead of duplicating the
nonnull check.

> +    node = rbcache_search_and_insert(data->cache, OFFSET3, SIZE3);
> +    g_assert_nonnull(node);
> +    g_assert_cmpuint(node->offset, ==, OFFSET3);
> +    g_assert_cmpuint(node->bytes, ==, SIZE3);
> +
> +    node = rbcache_search_and_insert(data->cache, OFFSET4, SIZE4);
> +    g_assert_nonnull(node);
> +    g_assert_cmpuint(node->offset, ==, OFFSET4);
> +    g_assert_cmpuint(node->bytes, ==, SIZE4);
> +
> +    node = rbcache_search_and_insert(data->cache, OFFSET5, SIZE5);
> +    g_assert_nonnull(node);
> +    g_assert_cmpuint(node->offset, ==, OFFSET2);
> +    g_assert_cmpuint(node->bytes, ==, SIZE2);
> +}
> +
> +static void test_rbcache_remove(TestRBCacheData *data, const void *ctx)
> +{
> +    RBCacheNode *node;
> +
> +    test_rbcache_search_and_insert(data, ctx);
> +
> +    node = rbcache_search(data->cache, OFFSET1, SIZE1);
> +    g_assert_nonnull(node);
> +    rbcache_remove(data->cache, node);
> +    node = rbcache_search(data->cache, OFFSET1, SIZE1);
> +    g_assert_null(node);
> +
> +    node = rbcache_search(data->cache, OFFSET3, SIZE3);
> +    g_assert_nonnull(node);
> +    rbcache_remove(data->cache, node);
> +    node = rbcache_search(data->cache, OFFSET3, SIZE3);
> +    g_assert_null(node);
> +
> +    node = rbcache_search(data->cache, OFFSET4, SIZE4);
> +    g_assert_nonnull(node);
> +    rbcache_remove(data->cache, node);
> +    node = rbcache_search(data->cache, OFFSET4, SIZE4);
> +    g_assert_null(node);
> +
> +    node = rbcache_search(data->cache, OFFSET2, SIZE2);
> +    g_assert_nonnull(node);
> +    rbcache_remove(data->cache, node);
> +    node = rbcache_search(data->cache, OFFSET2, SIZE2);
> +    g_assert_null(node);
> +}
> +
> +static void test_rbcache_shrink(TestRBCacheData *data, const void *ctx)
> +{
> +    RBCacheNode *node;
> +
> +    node = rbcache_search_and_insert(data->cache, 0, MB(2));
> +    g_assert_nonnull(node);
> +
> +    node = rbcache_search_and_insert(data->cache, MB(2), MB(3));
> +    g_assert_nonnull(node);
> +
> +    node = rbcache_search(data->cache, 0, MB(2));
> +    g_assert_null(node);
> +
> +    node = rbcache_search(data->cache, MB(2), MB(3));
> +    g_assert_nonnull(node);
> +
> +    node = rbcache_search_and_insert(data->cache, 0, MB(2));
> +    g_assert_nonnull(node);
> +
> +    node = rbcache_search(data->cache, 0, MB(2));
> +    g_assert_nonnull(node);
> +
> +    node = rbcache_search(data->cache, MB(2), MB(3));
> +    g_assert_null(node);
> +}
> +
> +static void test_rbcache_shrink_fifo(TestRBCacheData *data, const void *ctx)
> +{
> +    RBCacheNode *node;
> +
> +    rbcache_search_and_insert(data->cache, 0, MB(1));
> +    rbcache_search_and_insert(data->cache, MB(1), MB(1));
> +    rbcache_search_and_insert(data->cache, MB(2), MB(1));
> +    rbcache_search_and_insert(data->cache, MB(3), MB(1));
> +
> +    node = rbcache_search_and_insert(data->cache, MB(4), MB(1));
> +    g_assert_nonnull(node);
> +    node = rbcache_search(data->cache, 0, MB(1));
> +    g_assert_null(node);
> +
> +    node = rbcache_search_and_insert(data->cache, MB(5), MB(1));
> +    g_assert_nonnull(node);
> +    node = rbcache_search(data->cache, MB(1), MB(1));
> +    g_assert_null(node);
> +
> +    node = rbcache_search_and_insert(data->cache, MB(6), MB(1));
> +    g_assert_nonnull(node);
> +    node = rbcache_search(data->cache, MB(2), MB(1));
> +    g_assert_null(node);
> +
> +    node = rbcache_search_and_insert(data->cache, MB(7), MB(1));
> +    g_assert_nonnull(node);
> +    node = rbcache_search(data->cache, MB(3), MB(1));
> +    g_assert_null(node);

This test doesn't really show that this is any different from LRU mode.
LRU would behave exactly the same because you have no accesses to
existing nodes.

It also doesn't test that the other nodes still exist.

> +}
> +
> +static void test_rbcache_shrink_lru(TestRBCacheData *data, const void *ctx)
> +{
> +    RBCacheNode *node;
> +
> +    rbcache_search_and_insert(data->cache, 0, MB(1));
> +    rbcache_search_and_insert(data->cache, MB(1), MB(1));
> +    rbcache_search_and_insert(data->cache, MB(2), MB(1));
> +    rbcache_search_and_insert(data->cache, MB(3), MB(1));
> +
> +    node = rbcache_search(data->cache, MB(2), MB(1));
> +    g_assert_nonnull(node);
> +
> +    node = rbcache_search(data->cache, MB(1), MB(1));
> +    g_assert_nonnull(node);
> +
> +    node = rbcache_search_and_insert(data->cache, MB(4), MB(1));
> +    g_assert_nonnull(node);
> +    node = rbcache_search(data->cache, 0, MB(1));
> +    g_assert_null(node);
> +
> +    node = rbcache_search_and_insert(data->cache, MB(5), MB(1));
> +    g_assert_nonnull(node);
> +    node = rbcache_search(data->cache, MB(3), MB(1));
> +    g_assert_null(node);
> +
> +    node = rbcache_search_and_insert(data->cache, MB(6), MB(1));
> +    g_assert_nonnull(node);
> +    node = rbcache_search(data->cache, MB(2), MB(1));
> +    g_assert_null(node);
> +
> +    node = rbcache_search_and_insert(data->cache, MB(7), MB(1));
> +    g_assert_nonnull(node);
> +    node = rbcache_search(data->cache, MB(1), MB(1));
> +    g_assert_null(node);

Here as well we don't test whether the other nodes still exist, but at
least we have rbcache_search() calls to make the difference clear. You
should probably copy those calls to the FIFO test to show that they
don't have any effect on the order.

> +}
> +
> +static void test_rbcache_setup(TestRBCacheData *data, const void *ctx)
> +{
> +    const TestRBCacheConfig *config = ctx;
> +
> +    data->cache =
> +        rbcache_create(config->alloc, config->free, config->limit_size,
> +                       config->eviction_type, config->opaque);
> +}
> +
> +static void test_rbcache_teardown(TestRBCacheData *data, const void *ctx)
> +{
> +    rbcache_destroy(data->cache);
> +}
> +
> +static void rbcache_test_add(const char *testpath,
> +                             void (*test_func)(TestRBCacheData *data,
> +                             const void *user_data), void *ctx)
> +{
> +    g_test_add(testpath, TestRBCacheData, ctx, test_rbcache_setup, test_func,
> +               test_rbcache_teardown);
> +}
> +
> +int main(int argc, char **argv)
> +{
> +    TestRBCacheConfig config = {
> +        .limit_size = MB(4),
> +        .eviction_type = RBCACHE_FIFO,
> +    };
> +
> +    g_test_init(&argc, &argv, NULL);
> +
> +    rbcache_test_add("/rbcache/init", test_rbcache_init, &config);
> +    rbcache_test_add("/rbcache/insert", test_rbcache_insert, &config);
> +    rbcache_test_add("/rbcache/search", test_rbcache_search, &config);
> +    rbcache_test_add("/rbcache/search_and_insert",
> +                     test_rbcache_search_and_insert, &config);
> +    rbcache_test_add("/rbcache/rbcache_remove", test_rbcache_remove, &config);
> +    rbcache_test_add("/rbcache/shrink", test_rbcache_shrink, &config);
> +    rbcache_test_add("/rbcache/shrink/fifo", test_rbcache_shrink_fifo, &config);
> +
> +    config.eviction_type = RBCACHE_LRU;
> +    rbcache_test_add("/rbcache/shrink/lru", test_rbcache_shrink_lru, &config);
> +
> +    g_test_run();
> +
> +    return 0;
> +}

Kevin

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v1 01/18] block/io: add bdrv_aio_{preadv, pwritev}
  2016-11-24 10:58     ` Pavel Butsykin
@ 2016-11-24 12:36       ` Kevin Wolf
  2016-11-24 15:10         ` Pavel Butsykin
  0 siblings, 1 reply; 32+ messages in thread
From: Kevin Wolf @ 2016-11-24 12:36 UTC (permalink / raw)
  To: Pavel Butsykin
  Cc: qemu-devel, qemu-block, den, famz, stefanha, mreitz, eblake

Am 24.11.2016 um 11:58 hat Pavel Butsykin geschrieben:
> On 23.11.2016 17:28, Kevin Wolf wrote:
> >Am 15.11.2016 um 07:36 hat Pavel Butsykin geschrieben:
> >>It's just byte-based wrappers over bdrv_co_aio_prw_vector(), which provide
> >>  a byte-based interface for AIO read/write.
> >>
> >>Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
> >
> >I'm in the process to phase out the last users of bdrv_aio_*() so that
> >this set of interfaces can be removed. I'm doing this because it's an
> >unnecessary redundancy, we have too many wrapper functions that expose
> >the same functionality with different syntax. So let's not add new
> >users.
> >
> >At first sight, you don't even seem to use bdrv_aio_preadv() for actual
> >parallelism, but you often have a pattern like this:
> >
> >     void foo_cb(void *opaque)
> >     {
> >         ...
> >         qemu_coroutine_enter(acb->co);
> >     }
> >
> >     void caller()
> >     {
> >         ...
> >         acb = bdrv_aio_preadv(...);
> >         qemu_coroutine_yield();
> >     }
> >
> >The code will actually become a lot simpler if you use bdrv_co_preadv()
> >instead because you don't have to have a callback, but you get pure
> >sequential code.
> >
> >The part that actually has some parallelism, pcache_readahead_request(),
> >already creates its own coroutine, so it runs in the background without
> >using callback-style interfaces.
> 
> I used bdrv_co_preadv(), because it conveniently solves the partial
> cache hit. To solve the partial cache hit, we need to split a request
> into smaller parts, make asynchronous requests and wait for all
> requests in one place.
> 
> Do you propose to create a coroutine for each part of request? It
> seemed to me that bdrv_co_preadv() is a wrapper that allows us to get
> rid of the same code.

It's actually the other way round, bdrv_co_preadv() is the "native"
block layer API, and bdrv_aio_*() are wrappers providing an alternative
interface.


I'm looking at pcache_co_readahead(), for example. It looks like this:

    bdrv_aio_preadv(bs->file, node->common.offset, &readahead_acb.qiov,
                    node->common.bytes, pcache_aio_readahead_cb,
                    &readahead_acb);
    qemu_coroutine_yield();

And then we have pcache_aio_readahead_cb(), which ends in:

    qemu_coroutine_enter(acb->co);

So here the callback style doesn't buy you anything, it just rips the
code apart in two function. There is no parallelism here anyway,
pcache_co_readahead() doesn't do anything until the callback reenters
it. This is a very obvious example where bdrv_co_preadv() will simplify
the code.


It's similar with the other bdrv_aio_preadv() calls, which are in
pcache_co_preadv():

        if (bytes > s->max_aio_size) {
            bdrv_aio_preadv(bs->file, offset, qiov, bytes,
                            pcache_aio_read_cb, &acb);
            goto out;
        }

        update_req_stats(s->req_stats, offset, bytes);

        status = pcache_lookup_data(&acb);
        if (status == CACHE_MISS) {
            bdrv_aio_preadv(bs->file, offset, qiov, bytes,
                            pcache_aio_read_cb, &acb);
        } else if (status == PARTIAL_CACHE_HIT) {
            assert(acb.part.qiov.niov != 0);
            bdrv_aio_preadv(bs->file, acb.part.offset, &acb.part.qiov,
                            acb.part.bytes, pcache_aio_read_cb, &acb);
        }

        pcache_readahead_request(&acb);

        if (status == CACHE_HIT && --acb.ref == 0) {
            return 0;
        }

    out:
        qemu_coroutine_yield();

Here you have mainly the pcache_readahead_request() call between
bdrv_aio_preadv() and the yield. It only spawns a new coroutine, which
works in the background, so I think you can move it to before the reads
and then the reads can trivially become bdrv_co_preadv() and the
callback can again be inlined instead of ripping the function in two
parts.


The bdrv_aio_pwritev() call in pcache_co_pwritev() is just the same
thing and using the coroutine version results in obvious code
improvements.


And I think this are all uses of bdrv_aio_*() in the pcache driver, so
converting it to use bdrv_co_*() instead isn't only possible, but will
improve the legibility of your code, too. It's a clear win in all three
places.

Kevin

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v1 01/18] block/io: add bdrv_aio_{preadv, pwritev}
  2016-11-24 12:36       ` Kevin Wolf
@ 2016-11-24 15:10         ` Pavel Butsykin
  0 siblings, 0 replies; 32+ messages in thread
From: Pavel Butsykin @ 2016-11-24 15:10 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, qemu-block, den, famz, stefanha, mreitz, eblake

On 24.11.2016 15:36, Kevin Wolf wrote:
> Am 24.11.2016 um 11:58 hat Pavel Butsykin geschrieben:
>> On 23.11.2016 17:28, Kevin Wolf wrote:
>>> Am 15.11.2016 um 07:36 hat Pavel Butsykin geschrieben:
>>>> It's just byte-based wrappers over bdrv_co_aio_prw_vector(), which provide
>>>>   a byte-based interface for AIO read/write.
>>>>
>>>> Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
>>>
>>> I'm in the process to phase out the last users of bdrv_aio_*() so that
>>> this set of interfaces can be removed. I'm doing this because it's an
>>> unnecessary redundancy, we have too many wrapper functions that expose
>>> the same functionality with different syntax. So let's not add new
>>> users.
>>>
>>> At first sight, you don't even seem to use bdrv_aio_preadv() for actual
>>> parallelism, but you often have a pattern like this:
>>>
>>>      void foo_cb(void *opaque)
>>>      {
>>>          ...
>>>          qemu_coroutine_enter(acb->co);
>>>      }
>>>
>>>      void caller()
>>>      {
>>>          ...
>>>          acb = bdrv_aio_preadv(...);
>>>          qemu_coroutine_yield();
>>>      }
>>>
>>> The code will actually become a lot simpler if you use bdrv_co_preadv()
>>> instead because you don't have to have a callback, but you get pure
>>> sequential code.
>>>
>>> The part that actually has some parallelism, pcache_readahead_request(),
>>> already creates its own coroutine, so it runs in the background without
>>> using callback-style interfaces.
>>
>> I used bdrv_co_preadv(), because it conveniently solves the partial
>> cache hit. To solve the partial cache hit, we need to split a request
>> into smaller parts, make asynchronous requests and wait for all
>> requests in one place.
>>
>> Do you propose to create a coroutine for each part of request? It
>> seemed to me that bdrv_co_preadv() is a wrapper that allows us to get
>> rid of the same code.
>
> It's actually the other way round, bdrv_co_preadv() is the "native"
> block layer API, and bdrv_aio_*() are wrappers providing an alternative
> interface.
>

Sorry, I mixed up bdrv_co_preadv() and drv_aio_preadv() :)

Also I forgot that in this version, pcache can't split one request into 
many small requests(as in RFC version).

> I'm looking at pcache_co_readahead(), for example. It looks like this:
>
>      bdrv_aio_preadv(bs->file, node->common.offset, &readahead_acb.qiov,
>                      node->common.bytes, pcache_aio_readahead_cb,
>                      &readahead_acb);
>      qemu_coroutine_yield();
>
> And then we have pcache_aio_readahead_cb(), which ends in:
>
>      qemu_coroutine_enter(acb->co);
>
> So here the callback style doesn't buy you anything, it just rips the
> code apart in two function. There is no parallelism here anyway,
> pcache_co_readahead() doesn't do anything until the callback reenters
> it. This is a very obvious example where bdrv_co_preadv() will simplify
> the code.
>

I agree.

> It's similar with the other bdrv_aio_preadv() calls, which are in
> pcache_co_preadv():
>
>          if (bytes > s->max_aio_size) {
>              bdrv_aio_preadv(bs->file, offset, qiov, bytes,
>                              pcache_aio_read_cb, &acb);
>              goto out;
>          }
>
>          update_req_stats(s->req_stats, offset, bytes);
>
>          status = pcache_lookup_data(&acb);
>          if (status == CACHE_MISS) {
>              bdrv_aio_preadv(bs->file, offset, qiov, bytes,
>                              pcache_aio_read_cb, &acb);
>          } else if (status == PARTIAL_CACHE_HIT) {
>              assert(acb.part.qiov.niov != 0);
>              bdrv_aio_preadv(bs->file, acb.part.offset, &acb.part.qiov,
>                              acb.part.bytes, pcache_aio_read_cb, &acb);
>          }
>
>          pcache_readahead_request(&acb);
>
>          if (status == CACHE_HIT && --acb.ref == 0) {
>              return 0;
>          }
>
>      out:
>          qemu_coroutine_yield();
>
> Here you have mainly the pcache_readahead_request() call between
> bdrv_aio_preadv() and the yield. It only spawns a new coroutine, which
> works in the background, so I think you can move it to before the reads
> and then the reads can trivially become bdrv_co_preadv() and the
> callback can again be inlined instead of ripping the function in two
> parts.
>
>
> The bdrv_aio_pwritev() call in pcache_co_pwritev() is just the same
> thing and using the coroutine version results in obvious code
> improvements.
>
>
> And I think this are all uses of bdrv_aio_*() in the pcache driver, so
> converting it to use bdrv_co_*() instead isn't only possible, but will
> improve the legibility of your code, too. It's a clear win in all three
> places.
>

Yes, you're right, this scheme was acceptable when the caller was
bdrv_aio_*(), but now it just confuses. Thanks for the explanation!

> Kevin
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v1 02/18] block/pcache: empty pcache driver filter
  2016-11-23 15:15   ` Kevin Wolf
@ 2016-11-24 15:48     ` Pavel Butsykin
  2016-11-24 16:39       ` Kevin Wolf
  0 siblings, 1 reply; 32+ messages in thread
From: Pavel Butsykin @ 2016-11-24 15:48 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, qemu-block, den, famz, stefanha, mreitz, eblake

On 23.11.2016 18:15, Kevin Wolf wrote:
> Am 15.11.2016 um 07:36 hat Pavel Butsykin geschrieben:
>> The basic version of pcache driver for easy preparation of a patch set.
>>
>> Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
>> ---
>>   block/Makefile.objs |   1 +
>>   block/pcache.c      | 144 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 145 insertions(+)
>>   create mode 100644 block/pcache.c
>>
>> diff --git a/block/Makefile.objs b/block/Makefile.objs
>> index 7d4031d..c60f680 100644
>> --- a/block/Makefile.objs
>> +++ b/block/Makefile.objs
>> @@ -4,6 +4,7 @@ block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
>>   block-obj-y += qed-check.o
>>   block-obj-y += vhdx.o vhdx-endian.o vhdx-log.o
>>   block-obj-y += quorum.o
>> +block-obj-y += pcache.o
>>   block-obj-y += parallels.o blkdebug.o blkverify.o blkreplay.o
>>   block-obj-y += block-backend.o snapshot.o qapi.o
>>   block-obj-$(CONFIG_WIN32) += raw-win32.o win32-aio.o
>> diff --git a/block/pcache.c b/block/pcache.c
>> new file mode 100644
>> index 0000000..59461df
>> --- /dev/null
>> +++ b/block/pcache.c
>> @@ -0,0 +1,144 @@
>> +/*
>> + * Prefetch cache driver filter
>> + *
>> + * Copyright (C) 2015-2016 Parallels IP Holdings GmbH. All rights reserved.
>
> "All rights reserved" doesn't really agree with licensing under the GPL.
> It would be preferable to remove this note to avoid any ambiguity.

Yes, you're right. It seems that 'All rights reserved' is directly
inconsistent with GPL.

>> + *
>> + * Author: Pavel Butsykin <pbutsykin@virtuozzo.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>> + * later.  See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "block/block_int.h"
>> +#include "qapi/error.h"
>> +#include "qapi/qmp/qstring.h"
>> +
>> +
>> +static QemuOptsList runtime_opts = {
>> +    .name = "pcache",
>> +    .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head),
>> +    .desc = {
>> +        {
>> +            .name = "x-image",
>> +            .type = QEMU_OPT_STRING,
>> +            .help = "[internal use only, will be removed]",
>> +        },
>
> blkdebug/blkverify have this because they have to support legacy syntax
> like -drive file=blkdebug:blkdebug.conf:test.img, i.e. it has to deal
> with filenames.q
>
> Here we don't have to support a legacy syntax, so I would completely
> avoid this from the beginning. You already support the "image" option,
> which should be good enough.

Then the command line would look like this:

-drive 
file=/img/harddisk.hdd,if=none,id=drive-scsi0-0-0-0,cache=none,aio=native
-drive driver=pcache,image=scsi0-0-0-0,if=virtio

Ok.

>> +        { /* end of list */ }
>> +    },
>> +};
>> +
>> +typedef struct PCacheAIOCB {
>> +    Coroutine *co;
>> +    int ret;
>> +} PCacheAIOCB;
>> +
>> +static void pcache_aio_cb(void *opaque, int ret)
>> +{
>> +    PCacheAIOCB *acb = opaque;
>> +
>> +    acb->ret = ret;
>> +    qemu_coroutine_enter(acb->co);
>> +}
>> +
>> +static coroutine_fn int pcache_co_preadv(BlockDriverState *bs, uint64_t offset,
>> +                                         uint64_t bytes, QEMUIOVector *qiov,
>> +                                         int flags)
>> +{
>> +    PCacheAIOCB acb = {
>> +        .co = qemu_coroutine_self(),
>> +    };
>> +
>> +    bdrv_aio_preadv(bs->file, offset, qiov, bytes, pcache_aio_cb, &acb);
>> +
>> +    qemu_coroutine_yield();
>> +
>> +    return acb.ret;
>> +}
>
> As I commented on patch 1, all of this can be replaced by a single line:
>
>      return bdrv_co_preadv(bs->file, offset, bytes, qiov, flags);
>
>> +static coroutine_fn int pcache_co_pwritev(BlockDriverState *bs, uint64_t offset,
>> +                                          uint64_t bytes, QEMUIOVector *qiov,
>> +                                          int flags)
>> +{
>> +    PCacheAIOCB acb = {
>> +        .co = qemu_coroutine_self(),
>> +    };
>> +
>> +    bdrv_aio_pwritev(bs->file, offset, qiov, bytes, pcache_aio_cb, &acb);
>> +
>> +    qemu_coroutine_yield();
>> +
>> +    return acb.ret;
>> +}
>
> Same here.
>
>> +static int pcache_file_open(BlockDriverState *bs, QDict *options, int flags,
>> +                            Error **errp)
>> +{
>> +    QemuOpts *opts;
>> +    Error *local_err = NULL;
>> +    int ret = 0;
>> +
>> +    opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
>> +    qemu_opts_absorb_qdict(opts, options, &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        ret = -EINVAL;
>> +        goto fail;
>> +    }
>> +
>> +    assert(bs->file == NULL);
>> +    bs->file = bdrv_open_child(qemu_opt_get(opts, "x-image"), options,
>> +                               "image", bs, &child_format, false, &local_err);
>
> When removing x-image, the first parameter becomes NULL here.
>
>> +    if (local_err) {
>> +        ret = -EINVAL;
>> +        error_propagate(errp, local_err);
>> +    }
>> +fail:
>> +    qemu_opts_del(opts);
>> +    return ret;
>> +}
>> +
>> +static void pcache_close(BlockDriverState *bs)
>> +{
>> +}
>> +
>> +static void pcache_parse_filename(const char *filename, QDict *options,
>> +                                  Error **errp)
>> +{
>> +    qdict_put(options, "x-image", qstring_from_str(filename));
>> +}
>
> This one goes away.
>
>> +static int64_t pcache_getlength(BlockDriverState *bs)
>> +{
>> +    return bdrv_getlength(bs->file->bs);
>> +}
>> +
>> +static bool pcache_recurse_is_first_non_filter(BlockDriverState *bs,
>> +                                               BlockDriverState *candidate)
>> +{
>> +    return bdrv_recurse_is_first_non_filter(bs->file->bs, candidate);
>> +}
>> +
>> +static BlockDriver bdrv_pcache = {
>> +    .format_name                        = "pcache",
>> +    .protocol_name                      = "pcache",
>
> Remove this one, it would only be used for something like
> "pcache:<filename>" syntax.
>
>> +    .instance_size                      = 0,
>> +
>> +    .bdrv_parse_filename                = pcache_parse_filename,
>> +    .bdrv_file_open                     = pcache_file_open,
>> +    .bdrv_close                         = pcache_close,
>> +    .bdrv_getlength                     = pcache_getlength,
>> +
>> +    .bdrv_co_preadv                     = pcache_co_preadv,
>> +    .bdrv_co_pwritev                    = pcache_co_pwritev,
>> +
>> +    .is_filter                          = true,
>> +    .bdrv_recurse_is_first_non_filter   = pcache_recurse_is_first_non_filter,
>> +};
>> +
>> +static void bdrv_cache_init(void)
>> +{
>> +    bdrv_register(&bdrv_pcache);
>> +}
>> +
>> +block_init(bdrv_cache_init);
>
> You should also add pcache to the QAPI schema, so that blockdev-add can
> create instances of it.

Ok.

> Kevin
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v1 02/18] block/pcache: empty pcache driver filter
  2016-11-24 15:48     ` Pavel Butsykin
@ 2016-11-24 16:39       ` Kevin Wolf
  0 siblings, 0 replies; 32+ messages in thread
From: Kevin Wolf @ 2016-11-24 16:39 UTC (permalink / raw)
  To: Pavel Butsykin
  Cc: qemu-devel, qemu-block, den, famz, stefanha, mreitz, eblake

Am 24.11.2016 um 16:48 hat Pavel Butsykin geschrieben:
> On 23.11.2016 18:15, Kevin Wolf wrote:
> >Am 15.11.2016 um 07:36 hat Pavel Butsykin geschrieben:
> >>+static QemuOptsList runtime_opts = {
> >>+    .name = "pcache",
> >>+    .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head),
> >>+    .desc = {
> >>+        {
> >>+            .name = "x-image",
> >>+            .type = QEMU_OPT_STRING,
> >>+            .help = "[internal use only, will be removed]",
> >>+        },
> >
> >blkdebug/blkverify have this because they have to support legacy syntax
> >like -drive file=blkdebug:blkdebug.conf:test.img, i.e. it has to deal
> >with filenames.q
> >
> >Here we don't have to support a legacy syntax, so I would completely
> >avoid this from the beginning. You already support the "image" option,
> >which should be good enough.
> 
> Then the command line would look like this:
> 
> -drive file=/img/harddisk.hdd,if=none,id=drive-scsi0-0-0-0,cache=none,aio=native
> -drive driver=pcache,image=scsi0-0-0-0,if=virtio

Yes, either that or with an inline block node definition, the block
layer supports both:

-drive driver=pcache,image.drive=file,image.filename=/img/harddisk.hdd,if=virtio,cache=none,image.aio=native

Kevin

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v1 04/18] util/rbcache: range-based cache core
  2016-11-23 21:25   ` Kevin Wolf
@ 2016-11-24 19:23     ` Pavel Butsykin
  0 siblings, 0 replies; 32+ messages in thread
From: Pavel Butsykin @ 2016-11-24 19:23 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, qemu-block, den, famz, stefanha, mreitz, eblake

On 24.11.2016 00:25, Kevin Wolf wrote:
> Am 15.11.2016 um 07:37 hat Pavel Butsykin geschrieben:
>> RBCache provides functionality to cache the data from block devices
>> (basically). The range here is used as the main key for searching and storing
>> data. The cache is based on red-black trees, so basic operations search,
>> insert, delete are performed for O(log n).
>>
>> It is important to note that QEMU usually does not require a data cache, but
>> in reality, there are already some cases where a cache of small amounts can
>> increase performance, so as the data structure was selected red-black trees,
>> this is a fairly simple data structure and show high efficiency on a small
>> number of elements. Therefore, when the minimum range is 512 bytes, the
>> recommended size of the cache memory no more than 8-16mb.  Also note
>> that this cache implementation allows to store ranges of different lengths
>> without alignment.
>>
>> Generic cache core can easily be used to implement different caching policies at
>> the block level, such as read-ahed. Also it can be used in some special cases,
>> for example for caching data in qcow2 when sequential allocating writes to image
>> with backing file.
>>
>> Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
>> ---
>>   MAINTAINERS            |   6 ++
>>   include/qemu/rbcache.h | 105 +++++++++++++++++++++
>>   util/Makefile.objs     |   1 +
>>   util/rbcache.c         | 246 +++++++++++++++++++++++++++++++++++++++++++++++++
>>   4 files changed, 358 insertions(+)
>>   create mode 100644 include/qemu/rbcache.h
>>   create mode 100644 util/rbcache.c
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index ddf797b..cb74802 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -1365,6 +1365,12 @@ F: include/qemu/rbtree.h
>>   F: include/qemu/rbtree_augmented.h
>>   F: util/rbtree.c
>>
>> +Range-Based Cache
>> +M: Denis V. Lunev <den@openvz.org>
>> +S: Supported
>> +F: include/qemu/rbcache.h
>> +F: util/rbcache.c
>> +
>>   UUID
>>   M: Fam Zheng <famz@redhat.com>
>>   S: Supported
>> diff --git a/include/qemu/rbcache.h b/include/qemu/rbcache.h
>> new file mode 100644
>> index 0000000..c8f0a9f
>> --- /dev/null
>> +++ b/include/qemu/rbcache.h
>> @@ -0,0 +1,105 @@
>> +/*
>> + * QEMU Range-Based Cache core
>> + *
>> + * Copyright (C) 2015-2016 Parallels IP Holdings GmbH. All rights reserved.
>> + *
>> + * Author: Pavel Butsykin <pbutsykin@virtuozzo.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>> + * later.  See the COPYING file in the top-level directory.
>> + */
>> +
>> +#ifndef RBCACHE_H
>> +#define RBCACHE_H
>> +
>> +#include "qemu/rbtree.h"
>> +#include "qemu/queue.h"
>> +
>> +typedef struct RBCacheNode {
>> +    struct RbNode rb_node;
>> +    uint64_t offset;
>> +    uint64_t bytes;
>> +    QTAILQ_ENTRY(RBCacheNode) entry;
>> +} RBCacheNode;
>> +
>> +typedef struct RBCache RBCache;
>> +
>> +typedef RBCacheNode *RBNodeAlloc(uint64_t offset, uint64_t bytes, void *opaque);
>> +typedef void RBNodeFree(RBCacheNode *node, void *opaque);
>
> Maybe worth comments describing what these functions do apart from
> g_new()/g_free()? I assume that offset and bytes must be initialised
> from the parameters. Should rb_node and entry be zeroed?

This callback is called inside rbcache and serves to expand common
structure of a node descriptor without modifying code. Offset and size
here as the information about node, which actually may not be
necessary, as for example the offset in pcache_node_alloc(). rb_node
and entry similar to private fields that are initialized when node is
added to rbtree and these fields are absolutely not needed outside. But
maybe for extra precaution we can use g_new0().

>> +
>> +enum eviction_type {
>> +    RBCACHE_FIFO,
>> +    RBCACHE_LRU,
>> +};
>> +
>> +/**
>> + * rbcache_search:
>> + * @rbcache: the cache object.
>> + * @offset: the start of the range.
>> + * @bytes: the size of the range.
>> + *
>> + * Returns the node corresponding to the range(offset, bytes),
>> + * or NULL if the node was not found.
>> + */
>> +void *rbcache_search(RBCache *rbcache, uint64_t offset, uint64_t bytes);
>
> What if the range covers multiple nodes? Is it defined which of the
> nodes we return or do you just get any?

Yes, it's defined, returns the node with the lowest offset.

> Why does this function (and the following ones) return void* rather than
> RBCacheNode* if they are supposed to return a node?

This is done to get rid of the extra container_of() when using the
extended descriptor of the node, as for example PCacheNode. Is it not
acceptable?

>> +/**
>> + * rbcache_insert:
>> + * @rbcache: the cache object.
>> + * @node: a new node for the cache.
>> + *
>> + * Returns the new node, or old node if the node already exists.
>> + */
>> +void *rbcache_insert(RBCache *rbcache, RBCacheNode *node);
>
> What does "if the node already exists" mean? If @node (the very same
> object) is already stored in the cache object, or if a node describing
> the same range already exists?

if a node describing the same range already exists

>> +/**
>> + * rbcache_search_and_insert:
>> + * @rbcache: the cache object.
>> + * @offset: the start of the range.
>> + * @bytes: the size of the range.
>> + *
>> + * rbcache_search_and_insert() is like rbcache_insert(), except that a new node
>> + * is allocated inside the function. Returns the new node, or old node if the
>> + * node already exists.
>> + */
>> +void *rbcache_search_and_insert(RBCache *rbcache, uint64_t offset,
>> +                                uint64_t byte);
>
> What happens if a node exists, but only for part of the range?


In this case, it returns the existing node.

>> +/**
>> + * rbcache_remove:
>> + * @rbcache: the cache object.
>> + * @node: the node to remove.
>> + *
>> + * Removes the cached range owned by the node.
>> + */
>> +void rbcache_remove(RBCache *rbcache, RBCacheNode *node);
>> +
>> +RBCacheNode *rbcache_node_alloc(RBCache *rbcache, uint64_t offset,
>> +                                uint64_t bytes);
>> +void rbcache_node_free(RBCache *rbcache, RBCacheNode *node);
>
> No comment for these two?

will add.

>> +
>> +/**
>> + * rbcache_create:
>> + * @alloc: callback to allocation node, allows to upgrade allocate and expand
>> + *         the capabilities of the node.
>> + * @free: callback to release node, must be used together with alloc callback.
>> + * @limit_size: maximum cache size in bytes.
>> + * @eviction_type: method of memory limitation
>> + * @opaque: the opaque pointer to pass to the callback.
>> + *
>> + * Returns the cache object.
>> + */
>> +RBCache *rbcache_create(RBNodeAlloc *alloc, RBNodeFree *free,
>> +                        uint64_t limit_size, int eviction_type, void *opaque);
>> +
>> +/**
>> + * rbcache_destroy:
>> + * @rbcache: the cache object.
>> + *
>> + * Cleanup the cache object created with rbcache_create().
>> + */
>> +void rbcache_destroy(RBCache *rbcache);
>> +
>> +#endif /* RBCACHE_H */
>> diff --git a/util/Makefile.objs b/util/Makefile.objs
>> index 727a567..9fb0de6 100644
>> --- a/util/Makefile.objs
>> +++ b/util/Makefile.objs
>> @@ -38,3 +38,4 @@ util-obj-y += qdist.o
>>   util-obj-y += qht.o
>>   util-obj-y += range.o
>>   util-obj-y += rbtree.o
>> +util-obj-y += rbcache.o
>> diff --git a/util/rbcache.c b/util/rbcache.c
>> new file mode 100644
>> index 0000000..05cfa5a
>> --- /dev/null
>> +++ b/util/rbcache.c
>> @@ -0,0 +1,246 @@
>> +/*
>> + * QEMU Range-Based Cache core
>> + *
>> + * Copyright (C) 2015-2016 Parallels IP Holdings GmbH. All rights reserved.
>> + *
>> + * Author: Pavel Butsykin <pbutsykin@virtuozzo.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>> + * later.  See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/rbcache.h"
>> +
>> +/* RBCache provides functionality to cache the data from block devices
>> + * (basically). The range here is used as the main key for searching and storing
>> + * data. The cache is based on red-black trees, so basic operations search,
>> + * insert, delete are performed for O(log n).
>> + *
>> + * It is important to note that QEMU usually does not require a data cache, but
>> + * in reality, there are already some cases where a cache of small amounts can
>> + * increase performance, so as the data structure was selected red-black trees,
>> + * this is a quite simple data structure and show high efficiency on a small
>> + * number of elements. Therefore, when the minimum range is 512 bytes, the
>> + * recommended size of the cache memory no more than 8-16mb. Also note that this
>> + * cache implementation allows to store ranges of different lengths without
>> + * alignment.
>> + */
>> +
>> +struct RBCache {
>> +    struct RbRoot root;
>> +    RBNodeAlloc *alloc;
>> +    RBNodeFree  *free;
>> +    uint64_t limit_size;
>> +    uint64_t cur_size;
>> +    int eviction_type;
>
> s/int/enum eviction_type/
>
>> +    void *opaque;
>> +
>> +    QTAILQ_HEAD(RBCacheNodeHead, RBCacheNode) queue;
>> +};
>> +
>> +static int node_key_cmp(const RBCacheNode *node1, const RBCacheNode *node2)
>> +{
>> +    assert(node1 != NULL);
>> +    assert(node2 != NULL);
>> +
>> +    if (node1->offset >= node2->offset + node2->bytes) {
>> +        return 1;
>> +    }
>> +    if (node1->offset + node1->bytes <= node2->offset) {
>> +        return -1;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static RBCacheNode *node_previous(RBCacheNode* node, uint64_t target_offset)
>
> It would be nice to describe this function in a comment.

Ok.

>  From what I understand, it goes backwards in the tree until the last
> node whose range end is larger than target_offset is reached.

Right.

>> +{
>> +    while (node) {
>> +        struct RbNode *prev_rb_node = rb_prev(&node->rb_node);
>> +        RBCacheNode *prev_node;
>> +        if (prev_rb_node == NULL) {
>> +            break;
>> +        }
>> +        prev_node = container_of(prev_rb_node, RBCacheNode, rb_node);
>> +        if (prev_node->offset + prev_node->bytes <= target_offset) {
>> +            break;
>> +        }
>> +        node = prev_node;
>> +    }
>> +
>> +    assert(node != NULL);
>> +
>> +    return node;
>> +}
>> +
>> +RBCacheNode *rbcache_node_alloc(RBCache *rbcache, uint64_t offset,
>> +                                uint64_t bytes)
>> +{
>> +    RBCacheNode *node;
>> +
>> +    if (rbcache->alloc) {
>> +        node = rbcache->alloc(offset, bytes, rbcache->opaque);
>> +    } else {
>> +        node = g_malloc(sizeof(*node));
>
> g_new() is preferred.
>
>> +    }
>> +
>> +    node->offset = offset;
>> +    node->bytes  = bytes;
>
> Okay, so the callback doesn't have to set offset/bytes because it's
> already set here, and it can be NULL. Both things to be mentioned in the
> documentation.
>
>> +    return node;
>> +}
>> +
>> +void rbcache_node_free(RBCache *rbcache, RBCacheNode *node)
>> +{
>> +    if (rbcache->free) {
>> +        rbcache->free(node, rbcache->opaque);
>> +    } else {
>> +        g_free(node);
>> +    }
>> +}
>> +
>> +static void rbcache_try_shrink(RBCache *rbcache)
>> +{
>> +    while (rbcache->cur_size > rbcache->limit_size) {
>> +        RBCacheNode *node;
>> +        assert(!QTAILQ_EMPTY(&rbcache->queue));
>> +
>> +        node = QTAILQ_LAST(&rbcache->queue, RBCacheNodeHead);
>> +
>> +        rbcache_remove(rbcache, node);
>> +    }
>> +}
>> +
>> +static inline void node_move_in_queue(RBCache *rbcache, RBCacheNode *node)
>> +{
>> +    if (rbcache->eviction_type == RBCACHE_LRU) {
>> +        QTAILQ_REMOVE(&rbcache->queue, node, entry);
>> +        QTAILQ_INSERT_HEAD(&rbcache->queue, node, entry);
>> +    }
>> +}
>> +
>> +static RBCacheNode *node_insert(RBCache *rbcache, RBCacheNode *node, bool alloc)
>
> Here as well there seem to be a few different cases and it would be good
> to have a comment that specifies the behaviour for each of them. It
> would also give me something to review against.

Ok, I'll try to make more comments. It seems I got used to the code and
for me the issue became less noticeable.

>> +{
>> +    struct RbNode **new, *parent = NULL;
>> +
>> +    assert(rbcache != NULL);
>> +    assert(node->bytes != 0);
>> +
>> +    /* Figure out where to put new node */
>> +    new = &(rbcache->root.rb_node);
>> +    while (*new) {
>> +        RBCacheNode *this = container_of(*new, RBCacheNode, rb_node);
>> +        int result = node_key_cmp(node, this);
>> +        if (result == 0) {
>> +            this = node_previous(this, node->offset);
>
> So in case of partial overlaps, the existing overlapping node with the
> lowest offset is returned, but the range is not adjusted. Correct?

It's correct. RBCache cannot store overlapping nodes. But if we want to
fill in the missing ranges, well we can do it based on the returned
nodes (see pickup_parts_of_cache()).

>> +            node_move_in_queue(rbcache, this);
>> +            return this;
>> +        }
>> +        parent = *new;
>> +        new = result < 0 ? &((*new)->rb_left) : &((*new)->rb_right);
>> +    }
>> +
>> +    if (alloc) {
>> +        node = rbcache_node_alloc(rbcache, node->offset, node->bytes);
>> +    }
>> +    /* Add new node and rebalance tree. */
>> +    rb_link_node(&node->rb_node, parent, new);
>> +    rb_insert_color(&node->rb_node, &rbcache->root);
>> +
>> +    rbcache->cur_size += node->bytes;
>> +
>> +    rbcache_try_shrink(rbcache);
>> +
>> +    QTAILQ_INSERT_HEAD(&rbcache->queue, node, entry);
>> +
>> +    return node;
>> +}
>> +
>> +void *rbcache_search(RBCache *rbcache, uint64_t offset, uint64_t bytes)
>> +{
>> +    struct RbNode *rb_node;
>> +    RBCacheNode node = {
>> +        .offset = offset,
>> +        .bytes  = bytes,
>> +    };
>> +
>> +    assert(rbcache != NULL);
>> +
>> +    rb_node = rbcache->root.rb_node;
>> +    while (rb_node) {
>> +        RBCacheNode *this = container_of(rb_node, RBCacheNode, rb_node);
>> +        int32_t result = node_key_cmp(&node, this);
>> +        if (result == 0) {
>> +            this = node_previous(this, offset);
>> +            node_move_in_queue(rbcache, this);
>> +            return this;
>> +        }
>> +        rb_node = result < 0 ? rb_node->rb_left : rb_node->rb_right;
>> +    }
>> +    return NULL;
>> +}
>> +
>> +void *rbcache_insert(RBCache *rbcache, RBCacheNode *node)
>> +{
>> +    return node_insert(rbcache, node, false);
>> +}
>> +
>> +void *rbcache_search_and_insert(RBCache *rbcache, uint64_t offset,
>> +                                uint64_t bytes)
>> +{
>> +    RBCacheNode node = {
>> +        .offset = offset,
>> +        .bytes  = bytes,
>> +    };
>> +
>> +    return node_insert(rbcache, &node, true);
>> +}
>> +
>> +void rbcache_remove(RBCache *rbcache, RBCacheNode *node)
>> +{
>> +    assert(rbcache->cur_size >= node->bytes);
>> +
>> +    rbcache->cur_size -= node->bytes;
>> +    rb_erase(&node->rb_node, &rbcache->root);
>> +
>> +    QTAILQ_REMOVE(&rbcache->queue, node, entry);
>> +
>> +    rbcache_node_free(rbcache, node);
>> +}
>> +
>> +RBCache *rbcache_create(RBNodeAlloc *alloc, RBNodeFree *free,
>> +                        uint64_t limit_size, int eviction_type, void *opaque)
>> +{
>> +    RBCache *rbcache = g_malloc(sizeof(*rbcache));
>
> Again, g_new() is preferred.
>
>> +
>> +    /* We can't use only one callback, or both or neither */
>> +    assert(!(!alloc ^ !free));
>> +
>> +    rbcache->root   = RB_ROOT;
>> +    rbcache->alloc  = alloc;
>> +    rbcache->free   = free;
>> +    rbcache->opaque = opaque;
>> +    rbcache->limit_size = limit_size;
>> +    rbcache->cur_size   = 0;
>> +    rbcache->eviction_type = eviction_type;
>> +
>> +    QTAILQ_INIT(&rbcache->queue);
>
> And actually, it would be good to make sure that all fields are zeroed
> even if the struct is extended. You can do that with g_new0(), or -
> that's what I would do - write the initialisation like this:
>
>      *rbcache = (RBCache) {
>          .root       = RB_ROOT,
>          .alloc      = alloc,
>          ...
>      };

Yes, I like the second variant.

> If you align all values on a column (which is appreciated), please do so
> consistently and align everything to the same column as the longest
> field name (eviction_type).

Thank's, I will consider it.

>> +
>> +    return rbcache;
>> +}
>> +
>> +void rbcache_destroy(RBCache *rbcache)
>> +{
>> +    RBCacheNode *node, *next;
>> +
>> +    assert(rbcache != NULL);
>> +
>> +    QTAILQ_FOREACH_SAFE(node, &rbcache->queue, entry, next) {
>> +        QTAILQ_REMOVE(&rbcache->queue, node, entry);
>> +        rbcache_node_free(rbcache, node);
>> +    }
>> +
>> +    g_free(rbcache);
>> +}

What do you think overall about this? Is it suitable for your the data
cache in qcow2?

> Kevin
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v1 05/18] tests/test-rbcache: add test cases
  2016-11-24 12:20   ` Kevin Wolf
@ 2016-11-25  9:58     ` Pavel Butsykin
  2016-11-25 10:11       ` Kevin Wolf
  0 siblings, 1 reply; 32+ messages in thread
From: Pavel Butsykin @ 2016-11-25  9:58 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, qemu-block, den, famz, stefanha, mreitz, eblake

On 24.11.2016 15:20, Kevin Wolf wrote:
> Am 15.11.2016 um 07:37 hat Pavel Butsykin geschrieben:
>> Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
>> ---
>>   tests/Makefile.include |   3 +
>>   tests/test-rbcache.c   | 308 +++++++++++++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 311 insertions(+)
>>   create mode 100644 tests/test-rbcache.c
>>
>> diff --git a/tests/Makefile.include b/tests/Makefile.include
>> index 8162f6f..36bb472 100644
>> --- a/tests/Makefile.include
>> +++ b/tests/Makefile.include
>> @@ -54,6 +54,8 @@ check-unit-y += tests/test-hbitmap$(EXESUF)
>>   gcov-files-test-hbitmap-y = blockjob.c
>>   check-unit-y += tests/test-blockjob$(EXESUF)
>>   check-unit-y += tests/test-blockjob-txn$(EXESUF)
>> +gcov-files-test-rbcache-y = util/rbcache.c
>> +check-unit-y += tests/test-rbcache$(EXESUF)
>>   check-unit-y += tests/test-x86-cpuid$(EXESUF)
>>   # all code tested by test-x86-cpuid is inside topology.h
>>   gcov-files-test-x86-cpuid-y =
>> @@ -481,6 +483,7 @@ tests/test-blockjob-txn$(EXESUF): tests/test-blockjob-txn.o $(test-block-obj-y)
>>   tests/test-thread-pool$(EXESUF): tests/test-thread-pool.o $(test-block-obj-y)
>>   tests/test-iov$(EXESUF): tests/test-iov.o $(test-util-obj-y)
>>   tests/test-hbitmap$(EXESUF): tests/test-hbitmap.o $(test-util-obj-y)
>> +tests/test-rbcache$(EXESUF): tests/test-rbcache.o $(test-util-obj-y)
>>   tests/test-x86-cpuid$(EXESUF): tests/test-x86-cpuid.o
>>   tests/test-xbzrle$(EXESUF): tests/test-xbzrle.o migration/xbzrle.o page_cache.o $(test-util-obj-y)
>>   tests/test-cutils$(EXESUF): tests/test-cutils.o util/cutils.o
>> diff --git a/tests/test-rbcache.c b/tests/test-rbcache.c
>> new file mode 100644
>> index 0000000..1c72bfa
>> --- /dev/null
>> +++ b/tests/test-rbcache.c
>> @@ -0,0 +1,308 @@
>> +/*
>> + * QEMU Range-Based Cache core
>> + *
>> + * Copyright (C) 2015-2016 Parallels IP Holdings GmbH. All rights reserved.
>> + *
>> + * Author: Pavel Butsykin <pbutsykin@virtuozzo.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>> + * later.  See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/rbcache.h"
>> +
>> +typedef struct TestRBCacheData {
>> +    RBCache *cache;
>> +} TestRBCacheData;
>> +
>> +typedef struct TestRBCacheConfig {
>> +    uint64_t limit_size;
>> +    int eviction_type;
>> +    RBNodeAlloc *alloc;
>> +    RBNodeFree  *free;
>> +    void *opaque;
>> +} TestRBCacheConfig;
>> +
>> +#define KB(_n) ((_n) << 10)
>> +#define MB(_n) ((_n) << 20)
>> +
>> +#define OFFSET1 0
>> +#define SIZE1   KB(1)
>> +
>> +#define OFFSET2 KB(1)
>> +#define SIZE2   KB(2)
>> +
>> +#define OFFSET3 KB(15)
>> +#define SIZE3   KB(1)
>> +
>> +#define OFFSET4 KB(7)
>> +#define SIZE4   KB(7)
>> +
>> +#define OFFSET5 KB(2)
>> +#define SIZE5   KB(8)
>
> Visualised, we test these requests:
>
> 1: *
> 2:  **
> 3:                *
> 4:        *******
> 5:   ********
>
> You test inserting the only element, inserting after the last element,
> inserting in the middle and inserting something that overlaps two other
> requests at its start and end.
>
> That's a good start, but it might be worth testing more scenarios:
>
> - Inserting a new first element to a non-empty cache

What do you mean? To insert an element with zero offset when the cache
already contains other nodes.?

> - Overlapping only at the start
> - Overlapping only at the end
> - Overlapping in the middle (i.e. including existing ranges as a
>    subset)
>      * With only one node
>      * With multiple nodes (like adding offset=2, size=16kb here)
>

Ok.

>> +
>> +static void test_rbcache_init(TestRBCacheData *data, const void *ctx)
>> +{
>> +    g_assert_nonnull(data->cache);
>> +}
>> +
>> +static void test_rbcache_insert(TestRBCacheData *data, const void *ctx)
>> +{
>> +    RBCacheNode *node1 = rbcache_node_alloc(data->cache, OFFSET1, SIZE1);
>> +    RBCacheNode *node2 = rbcache_node_alloc(data->cache, OFFSET2, SIZE2);
>> +    RBCacheNode *node3 = rbcache_node_alloc(data->cache, OFFSET3, SIZE3);
>> +    RBCacheNode *node4 = rbcache_node_alloc(data->cache, OFFSET4, SIZE4);
>> +    RBCacheNode *node5 = rbcache_node_alloc(data->cache, OFFSET5, SIZE5);
>> +    RBCacheNode *node;
>> +
>> +    node = rbcache_insert(data->cache, node1);
>> +    g_assert_true(node == node1);
>> +
>> +    node = rbcache_insert(data->cache, node2);
>> +    g_assert_true(node == node2);
>> +
>> +    node = rbcache_insert(data->cache, node3);
>> +    g_assert_true(node == node3);
>> +
>> +    node = rbcache_insert(data->cache, node4);
>> +    g_assert_true(node == node4);
>> +
>> +    node = rbcache_insert(data->cache, node5);
>> +    g_assert_true(node != node5);
>
> You can test for the right result instead:
>
>      g_assert_true(node == node2);
>
>
>> +    rbcache_node_free(data->cache, node5);
>> +}
>> +
>> +static void test_rbcache_search(TestRBCacheData *data, const void *ctx)
>> +{
>> +    RBCacheNode *node;
>> +
>> +    test_rbcache_insert(data, ctx);
>> +
>> +    node = rbcache_search(data->cache, OFFSET1, SIZE1);
>> +    g_assert_nonnull(node);
>> +    g_assert_cmpuint(node->offset, ==, OFFSET1);
>> +    g_assert_cmpuint(node->bytes, ==, SIZE1);
>> +
>> +    node = rbcache_search(data->cache, OFFSET2 + KB(1), SIZE1);
>
> Intentional that you use SIZE1 here even though we're looking at node 2?

No, here we need to use SIZE2 or SIZE2 - KB(1)

>> +    g_assert_nonnull(node);
>> +    g_assert_cmpuint(node->offset, ==, OFFSET2);
>> +    g_assert_cmpuint(node->bytes, ==, SIZE2);
>> +
>> +    node = rbcache_search(data->cache, OFFSET5, SIZE5);
>> +    g_assert_nonnull(node);
>> +    g_assert_cmpuint(node->offset, ==, OFFSET2);
>> +    g_assert_cmpuint(node->bytes, ==, SIZE2);
>> +
>> +    node = rbcache_search(data->cache, OFFSET5 + KB(2), SIZE5);
>> +    g_assert_nonnull(node);
>> +    g_assert_cmpuint(node->offset, ==, OFFSET4);
>> +    g_assert_cmpuint(node->bytes, ==, SIZE4);
>> +
>> +    node = rbcache_search(data->cache, OFFSET3 + SIZE3, SIZE3);
>> +    g_assert_null(node);
>> +}
>> +
>> +static void test_rbcache_search_and_insert(TestRBCacheData *data,
>> +                                           const void *ctx)
>> +{
>> +    RBCacheNode *node;
>> +
>> +    node = rbcache_search_and_insert(data->cache, OFFSET1, SIZE1);
>> +    g_assert_nonnull(node);
>> +    g_assert_cmpuint(node->offset, ==, OFFSET1);
>> +    g_assert_cmpuint(node->bytes, ==, SIZE1);
>> +
>> +    node = rbcache_search_and_insert(data->cache, OFFSET2, SIZE2);
>> +    g_assert_nonnull(node);
>> +    g_assert_cmpuint(node->offset, ==, OFFSET2);
>> +    g_assert_nonnull(node);
>
> I think you wanted to check node->bytes here instead of duplicating the
> nonnull check.
>
>> +    node = rbcache_search_and_insert(data->cache, OFFSET3, SIZE3);
>> +    g_assert_nonnull(node);
>> +    g_assert_cmpuint(node->offset, ==, OFFSET3);
>> +    g_assert_cmpuint(node->bytes, ==, SIZE3);
>> +
>> +    node = rbcache_search_and_insert(data->cache, OFFSET4, SIZE4);
>> +    g_assert_nonnull(node);
>> +    g_assert_cmpuint(node->offset, ==, OFFSET4);
>> +    g_assert_cmpuint(node->bytes, ==, SIZE4);
>> +
>> +    node = rbcache_search_and_insert(data->cache, OFFSET5, SIZE5);
>> +    g_assert_nonnull(node);
>> +    g_assert_cmpuint(node->offset, ==, OFFSET2);
>> +    g_assert_cmpuint(node->bytes, ==, SIZE2);
>> +}
>> +
>> +static void test_rbcache_remove(TestRBCacheData *data, const void *ctx)
>> +{
>> +    RBCacheNode *node;
>> +
>> +    test_rbcache_search_and_insert(data, ctx);
>> +
>> +    node = rbcache_search(data->cache, OFFSET1, SIZE1);
>> +    g_assert_nonnull(node);
>> +    rbcache_remove(data->cache, node);
>> +    node = rbcache_search(data->cache, OFFSET1, SIZE1);
>> +    g_assert_null(node);
>> +
>> +    node = rbcache_search(data->cache, OFFSET3, SIZE3);
>> +    g_assert_nonnull(node);
>> +    rbcache_remove(data->cache, node);
>> +    node = rbcache_search(data->cache, OFFSET3, SIZE3);
>> +    g_assert_null(node);
>> +
>> +    node = rbcache_search(data->cache, OFFSET4, SIZE4);
>> +    g_assert_nonnull(node);
>> +    rbcache_remove(data->cache, node);
>> +    node = rbcache_search(data->cache, OFFSET4, SIZE4);
>> +    g_assert_null(node);
>> +
>> +    node = rbcache_search(data->cache, OFFSET2, SIZE2);
>> +    g_assert_nonnull(node);
>> +    rbcache_remove(data->cache, node);
>> +    node = rbcache_search(data->cache, OFFSET2, SIZE2);
>> +    g_assert_null(node);
>> +}
>> +
>> +static void test_rbcache_shrink(TestRBCacheData *data, const void *ctx)
>> +{
>> +    RBCacheNode *node;
>> +
>> +    node = rbcache_search_and_insert(data->cache, 0, MB(2));
>> +    g_assert_nonnull(node);
>> +
>> +    node = rbcache_search_and_insert(data->cache, MB(2), MB(3));
>> +    g_assert_nonnull(node);
>> +
>> +    node = rbcache_search(data->cache, 0, MB(2));
>> +    g_assert_null(node);
>> +
>> +    node = rbcache_search(data->cache, MB(2), MB(3));
>> +    g_assert_nonnull(node);
>> +
>> +    node = rbcache_search_and_insert(data->cache, 0, MB(2));
>> +    g_assert_nonnull(node);
>> +
>> +    node = rbcache_search(data->cache, 0, MB(2));
>> +    g_assert_nonnull(node);
>> +
>> +    node = rbcache_search(data->cache, MB(2), MB(3));
>> +    g_assert_null(node);
>> +}
>> +
>> +static void test_rbcache_shrink_fifo(TestRBCacheData *data, const void *ctx)
>> +{
>> +    RBCacheNode *node;
>> +
>> +    rbcache_search_and_insert(data->cache, 0, MB(1));
>> +    rbcache_search_and_insert(data->cache, MB(1), MB(1));
>> +    rbcache_search_and_insert(data->cache, MB(2), MB(1));
>> +    rbcache_search_and_insert(data->cache, MB(3), MB(1));
>> +
>> +    node = rbcache_search_and_insert(data->cache, MB(4), MB(1));
>> +    g_assert_nonnull(node);
>> +    node = rbcache_search(data->cache, 0, MB(1));
>> +    g_assert_null(node);
>> +
>> +    node = rbcache_search_and_insert(data->cache, MB(5), MB(1));
>> +    g_assert_nonnull(node);
>> +    node = rbcache_search(data->cache, MB(1), MB(1));
>> +    g_assert_null(node);
>> +
>> +    node = rbcache_search_and_insert(data->cache, MB(6), MB(1));
>> +    g_assert_nonnull(node);
>> +    node = rbcache_search(data->cache, MB(2), MB(1));
>> +    g_assert_null(node);
>> +
>> +    node = rbcache_search_and_insert(data->cache, MB(7), MB(1));
>> +    g_assert_nonnull(node);
>> +    node = rbcache_search(data->cache, MB(3), MB(1));
>> +    g_assert_null(node);
>
> This test doesn't really show that this is any different from LRU mode.
> LRU would behave exactly the same because you have no accesses to
> existing nodes.
>
> It also doesn't test that the other nodes still exist.
>
>> +}
>> +
>> +static void test_rbcache_shrink_lru(TestRBCacheData *data, const void *ctx)
>> +{
>> +    RBCacheNode *node;
>> +
>> +    rbcache_search_and_insert(data->cache, 0, MB(1));
>> +    rbcache_search_and_insert(data->cache, MB(1), MB(1));
>> +    rbcache_search_and_insert(data->cache, MB(2), MB(1));
>> +    rbcache_search_and_insert(data->cache, MB(3), MB(1));
>> +
>> +    node = rbcache_search(data->cache, MB(2), MB(1));
>> +    g_assert_nonnull(node);
>> +
>> +    node = rbcache_search(data->cache, MB(1), MB(1));
>> +    g_assert_nonnull(node);
>> +
>> +    node = rbcache_search_and_insert(data->cache, MB(4), MB(1));
>> +    g_assert_nonnull(node);
>> +    node = rbcache_search(data->cache, 0, MB(1));
>> +    g_assert_null(node);
>> +
>> +    node = rbcache_search_and_insert(data->cache, MB(5), MB(1));
>> +    g_assert_nonnull(node);
>> +    node = rbcache_search(data->cache, MB(3), MB(1));
>> +    g_assert_null(node);
>> +
>> +    node = rbcache_search_and_insert(data->cache, MB(6), MB(1));
>> +    g_assert_nonnull(node);
>> +    node = rbcache_search(data->cache, MB(2), MB(1));
>> +    g_assert_null(node);
>> +
>> +    node = rbcache_search_and_insert(data->cache, MB(7), MB(1));
>> +    g_assert_nonnull(node);
>> +    node = rbcache_search(data->cache, MB(1), MB(1));
>> +    g_assert_null(node);
>
> Here as well we don't test whether the other nodes still exist, but at
> least we have rbcache_search() calls to make the difference clear. You
> should probably copy those calls to the FIFO test to show that they
> don't have any effect on the order.

Yes, good point.

>> +}
>> +
>> +static void test_rbcache_setup(TestRBCacheData *data, const void *ctx)
>> +{
>> +    const TestRBCacheConfig *config = ctx;
>> +
>> +    data->cache =
>> +        rbcache_create(config->alloc, config->free, config->limit_size,
>> +                       config->eviction_type, config->opaque);
>> +}
>> +
>> +static void test_rbcache_teardown(TestRBCacheData *data, const void *ctx)
>> +{
>> +    rbcache_destroy(data->cache);
>> +}
>> +
>> +static void rbcache_test_add(const char *testpath,
>> +                             void (*test_func)(TestRBCacheData *data,
>> +                             const void *user_data), void *ctx)
>> +{
>> +    g_test_add(testpath, TestRBCacheData, ctx, test_rbcache_setup, test_func,
>> +               test_rbcache_teardown);
>> +}
>> +
>> +int main(int argc, char **argv)
>> +{
>> +    TestRBCacheConfig config = {
>> +        .limit_size = MB(4),
>> +        .eviction_type = RBCACHE_FIFO,
>> +    };
>> +
>> +    g_test_init(&argc, &argv, NULL);
>> +
>> +    rbcache_test_add("/rbcache/init", test_rbcache_init, &config);
>> +    rbcache_test_add("/rbcache/insert", test_rbcache_insert, &config);
>> +    rbcache_test_add("/rbcache/search", test_rbcache_search, &config);
>> +    rbcache_test_add("/rbcache/search_and_insert",
>> +                     test_rbcache_search_and_insert, &config);
>> +    rbcache_test_add("/rbcache/rbcache_remove", test_rbcache_remove, &config);
>> +    rbcache_test_add("/rbcache/shrink", test_rbcache_shrink, &config);
>> +    rbcache_test_add("/rbcache/shrink/fifo", test_rbcache_shrink_fifo, &config);
>> +
>> +    config.eviction_type = RBCACHE_LRU;
>> +    rbcache_test_add("/rbcache/shrink/lru", test_rbcache_shrink_lru, &config);
>> +
>> +    g_test_run();
>> +
>> +    return 0;
>> +}
>
> Kevin
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v1 05/18] tests/test-rbcache: add test cases
  2016-11-25  9:58     ` Pavel Butsykin
@ 2016-11-25 10:11       ` Kevin Wolf
  0 siblings, 0 replies; 32+ messages in thread
From: Kevin Wolf @ 2016-11-25 10:11 UTC (permalink / raw)
  To: Pavel Butsykin
  Cc: qemu-devel, qemu-block, den, famz, stefanha, mreitz, eblake

Am 25.11.2016 um 10:58 hat Pavel Butsykin geschrieben:
> On 24.11.2016 15:20, Kevin Wolf wrote:
> >Visualised, we test these requests:
> >
> >1: *
> >2:  **
> >3:                *
> >4:        *******
> >5:   ********
> >
> >You test inserting the only element, inserting after the last element,
> >inserting in the middle and inserting something that overlaps two other
> >requests at its start and end.
> >
> >That's a good start, but it might be worth testing more scenarios:
> >
> >- Inserting a new first element to a non-empty cache
> 
> What do you mean? To insert an element with zero offset when the cache
> already contains other nodes.?

Yes, that would be one way to do it. Maybe just swap requests 1 and 2.

> >- Overlapping only at the start
> >- Overlapping only at the end
> >- Overlapping in the middle (i.e. including existing ranges as a
> >   subset)
> >     * With only one node
> >     * With multiple nodes (like adding offset=2, size=16kb here)
> >
> 
> Ok.

Kevin

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2016-11-25 10:11 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-15  6:36 [Qemu-devel] [PATCH v1 00/18] I/O prefetch cache Pavel Butsykin
2016-11-15  6:36 ` [Qemu-devel] [PATCH v1 01/18] block/io: add bdrv_aio_{preadv, pwritev} Pavel Butsykin
2016-11-23 14:28   ` Kevin Wolf
2016-11-24 10:58     ` Pavel Butsykin
2016-11-24 12:36       ` Kevin Wolf
2016-11-24 15:10         ` Pavel Butsykin
2016-11-15  6:36 ` [Qemu-devel] [PATCH v1 02/18] block/pcache: empty pcache driver filter Pavel Butsykin
2016-11-23 15:15   ` Kevin Wolf
2016-11-24 15:48     ` Pavel Butsykin
2016-11-24 16:39       ` Kevin Wolf
2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 03/18] util/rbtree: add rbtree from linux kernel Pavel Butsykin
2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 04/18] util/rbcache: range-based cache core Pavel Butsykin
2016-11-23 21:25   ` Kevin Wolf
2016-11-24 19:23     ` Pavel Butsykin
2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 05/18] tests/test-rbcache: add test cases Pavel Butsykin
2016-11-24 12:20   ` Kevin Wolf
2016-11-25  9:58     ` Pavel Butsykin
2016-11-25 10:11       ` Kevin Wolf
2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 06/18] block/pcache: statistics collection read requests Pavel Butsykin
2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 07/18] block/pcache: skip large aio read Pavel Butsykin
2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 08/18] block/pcache: updating statistics for overlapping requests Pavel Butsykin
2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 09/18] block/pcache: add AIO readahead Pavel Butsykin
2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 10/18] block/pcache: skip readahead for unallocated clusters Pavel Butsykin
2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 11/18] block/pcache: cache invalidation on AIO write requests Pavel Butsykin
2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 12/18] block/pcache: add reading data from the cache Pavel Butsykin
2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 13/18] block/pcache: inflight readahead request waiting for aio read Pavel Butsykin
2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 14/18] backup/pcache: pick up parts of the cache Pavel Butsykin
2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 15/18] block/pcache: drop used pcache nodes Pavel Butsykin
2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 16/18] block/pcache: write through Pavel Butsykin
2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 17/18] block/pcache: add tracepoints Pavel Butsykin
2016-11-15  6:37 ` [Qemu-devel] [PATCH v1 18/18] block/pcache: debug build Pavel Butsykin
2016-11-15 16:18 ` [Qemu-devel] [PATCH v1 00/18] I/O prefetch cache no-reply

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.