All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v2 00/18] I/O prefetch cache
@ 2016-12-30 14:31 Pavel Butsykin
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 01/18] block/pcache: empty pcache driver filter Pavel Butsykin
                   ` (18 more replies)
  0 siblings, 19 replies; 20+ messages in thread
From: Pavel Butsykin @ 2016-12-30 14:31 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: kwolf, mreitz, den, eblake, armbru

The prefetch cache aims to improve the performance of sequential read data.
Of most interest here are the requests of a small size of data for sequential
read, such requests can be optimized by extending them and moving into 
the prefetch cache. However, there are 2 issues:
 - In aggregate only a small portion of requests is sequential, so delays caused
   by the need to read more volumes of data will lead to an overall decrease
   in performance.
 - The presence of redundant data in the cache memory with a large number of
   random requests.
This pcache implementation solves the above and other problems prefetching data.
The pcache algorithm can be summarised by the following main steps.

1. Monitor I/O requests to identify typical sequences.
This implementation of prefetch cache works at the storage system level and has 
information only about the physical block addresses of I/O requests. Statistics 
are collected only from read requests to a maximum size of 64kb(by default),
each request that matches the criteria falls into a pool of requests. In order
to store request statistics used by the rb-tree, it's simple but for
this issue a quite efficient data structure.

2. Identifying sequential I/O streams.
For each read request to be carried out attempting to lift the chain sequence 
from the tree statistics, where this request will be element of a sequential
chain of requests. The key to search for consecutive requests is the area of bytes 
preceding the current request. The size of this area should not be too small to 
avoid false readahead. The sequential stream data requests can be identified
even when a large number of random requests. For example, if there is access to 
the blocks 100, 1157, 27520, 4, 101, 312, 1337, 102, in the context of request
processing 102 will be identified the chain of sequential requests 100, 101. 102
and then should a decision be made to do readahead. Also a situation may arise
when multiple applications A, B, C simultaneously perform sequential read of
data. For each separate application that will be sequential read data 
A(100, 101, 102), B(300, 301, 302), C(700, 701, 702), but for block devices it 
may look like a random data reading: 100,300,700,101,301,701,102,302,702. 
In this case, the sequential streams will also be recognised because location
requests in the rb-tree will allow to separate the sequential I/O streams.

3. Do the readahead into the cache for recognized sequential data streams.
After the issue of the detection of pcache case was resolved, need using larger 
requests to bring data into the cache. In this implementation the pcache used
readahead instead of the extension request, therefore the request goes as is. 
There is not any reason to put data in the cache that will never be picked up, 
but this will always happen in the case of extension requests. In order to store
areas of cached blocks is also used the rb-tree, it's simple but for this issue
a quite efficient data structure.

4. Control size of the prefetch cache pool and the requests statistic pool
For control the border of the pool statistic of requests, the data of requests 
are placed and replaced according to the FIFO principle, everything is simple.
For control the boundaries of the memory cache used LRU list, it allows to limit
the max amount memory that we can allocate for pcache. But the LRU is there
mainly to prevent displacement of the cache blocks that was read partially. 
The main way the memory is pushed out immediately after use, as soon as a chunk
of memory from the cache has been completely read, since the probability of
repetition of the request is very low. Cases when one and the same portion of
the cache memory has been read several times are not optimized and do not apply
to the cases that can optimize the pcache. Thus, using a cache memory of small
volume, by the optimization of the operations read-ahead and clear memory, we
can read entire volumes of data, providing a 100% cache hit. Also does not
decrease the effectiveness of random read requests.

PCache is implemented as a qemu block filter driver, has some configurable
parameters, such as: total cache size, statistics size, readahead size,
maximum size of block that can be processed.

For performance evaluation has been used several test cases with different
sequential and random read data on SSD disk. Here are the results of tests and
qemu parameters:

qemu parameters: 
-machine pc,accel=kvm,usb=off,vmport=off -m 1024 -smp 8
-drive file=/img/harddisk.hdd,if=none,cache=none,id=drive-scsi0-0-0-0,aio=native
-drive driver=pcache,image=drive-scsi0-0-0-0,if=virtio

*****************************************************************
* Testcase                        * Results in iops             *
*                                 *******************************
*                                 *  clean qemu  *    pcache    *
*****************************************************************
* Create/open 16 file(s) of total * 21645 req/s  * 74793 req/s  *
* size 2048.00 MB named           * 20385 req/s  * 66481 req/s  *
* /tmp/tmp.tmp, start 4 thread(s) * 20616 req/s  * 69007 req/s  *
* and do uncached sequential read *              *              *
* by 4KB blocks                   *              *              *
*****************************************************************
* Create/open 16 file(s) of total * 84033 req/s  * 87828 req/s  *
* size 2048.00 MB named           * 84602 req/s  * 89678 req/s  *
* /tmp/tmp.tmp, start 4 thread(s) * 83163 req/s  * 96202 req/s  *
* and do uncached sequential read *              *              *
* by 4KB blocks with constant     *              *              *
* queue len 32                    *              *              *
*****************************************************************
* Create/open 16 file(s) of total * 14104 req/s  * 14164 req/s  *
* size 2048.00 MB named           * 14130 req/s  * 14232 req/s  *
* /tmp/tmp.tmp, start 4 thread(s) * 14183 req/s  * 14080 req/s  *
* and do uncached random read by  *              *              *
* 4KB blocks                      *              *              *
*****************************************************************
* Create/open 16 file(s) of total * 23480 req/s  * 23483 req/s  *
* size 2048.00 MB named           * 23070 req/s  * 22432 req/s  *
* /tmp/tmp.tmp, start 4 thread(s) * 24090 req/s  * 23499 req/s  *
* and do uncached random read by  *              *              *
* 4KB blocks with constant queue  *              *              *
* len 32                          *              *              *
*****************************************************************

Changes from v1:
- avoid bdrv_aio_*() interfaces
- add pcache to the QAPI schema
- fix remarks and add more comments for rbcache
- add more scenarios for "/rbcache/insert" test
- fix rbcache/shrink/* tests
- pcache: up-to-date cache for removed nodes
- rewrite "block/pcache: pick up parts of the cache" patch
- changed the statuses of nodes for a more flexible determination of
  the node state

Pavel Butsykin (18):
  block/pcache: empty pcache driver filter
  util/rbtree: add rbtree from linux kernel
  util/rbcache: range-based cache core
  tests/test-rbcache: add test cases
  block/pcache: statistics collection read requests
  block/pcache: skip large aio read
  block/pcache: updating statistics for overlapping requests
  block/pcache: add AIO readahead
  block/pcache: skip readahead for unallocated clusters
  block/pcache: cache invalidation on write requests
  block/pcache: add reading data from the cache
  block/pcache: inflight readahead request waiting for read
  block/pcache: write through
  block/pcache: up-to-date cache for removed nodes
  block/pcache: pick up parts of the cache
  block/pcache: drop used pcache nodes
  qapi: allow blockdev-add for pcache
  block/pcache: add tracepoints

 MAINTAINERS                     |  13 +
 block/Makefile.objs             |   1 +
 block/pcache.c                  | 764 ++++++++++++++++++++++++++++++++++++++++
 block/trace-events              |  10 +
 include/qemu/rbcache.h          | 128 +++++++
 include/qemu/rbtree.h           | 107 ++++++
 include/qemu/rbtree_augmented.h | 235 ++++++++++++
 qapi/block-core.json            |  30 +-
 tests/Makefile.include          |   3 +
 tests/test-rbcache.c            | 431 +++++++++++++++++++++++
 util/Makefile.objs              |   2 +
 util/rbcache.c                  | 253 +++++++++++++
 util/rbtree.c                   | 570 ++++++++++++++++++++++++++++++
 13 files changed, 2545 insertions(+), 2 deletions(-)
 create mode 100644 block/pcache.c
 create mode 100644 include/qemu/rbcache.h
 create mode 100644 include/qemu/rbtree.h
 create mode 100644 include/qemu/rbtree_augmented.h
 create mode 100644 tests/test-rbcache.c
 create mode 100644 util/rbcache.c
 create mode 100644 util/rbtree.c

-- 
2.11.0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v2 01/18] block/pcache: empty pcache driver filter
  2016-12-30 14:31 [Qemu-devel] [PATCH v2 00/18] I/O prefetch cache Pavel Butsykin
@ 2016-12-30 14:31 ` Pavel Butsykin
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 02/18] util/rbtree: add rbtree from linux kernel Pavel Butsykin
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Pavel Butsykin @ 2016-12-30 14:31 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: kwolf, mreitz, den, eblake, armbru

The basic version of pcache driver for easy preparation of a patch set.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/Makefile.objs |   1 +
 block/pcache.c      | 102 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 103 insertions(+)
 create mode 100644 block/pcache.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 67a036a1df..0a024f4e66 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -4,6 +4,7 @@ block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
 block-obj-y += qed-check.o
 block-obj-y += vhdx.o vhdx-endian.o vhdx-log.o
 block-obj-y += quorum.o
+block-obj-y += pcache.o
 block-obj-y += parallels.o blkdebug.o blkverify.o blkreplay.o
 block-obj-y += block-backend.o snapshot.o qapi.o
 block-obj-$(CONFIG_WIN32) += raw-win32.o win32-aio.o
diff --git a/block/pcache.c b/block/pcache.c
new file mode 100644
index 0000000000..7a67618408
--- /dev/null
+++ b/block/pcache.c
@@ -0,0 +1,102 @@
+/*
+ * Prefetch cache driver filter
+ *
+ * Copyright (C) 2015-2016 Parallels IP Holdings GmbH.
+ *
+ * Author: Pavel Butsykin <pbutsykin@virtuozzo.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "block/block_int.h"
+#include "qapi/error.h"
+#include "qapi/qmp/qstring.h"
+
+
+static QemuOptsList runtime_opts = {
+    .name = "pcache",
+    .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head),
+    .desc = {
+        { /* end of list */ }
+    },
+};
+
+static coroutine_fn int pcache_co_preadv(BlockDriverState *bs, uint64_t offset,
+                                         uint64_t bytes, QEMUIOVector *qiov,
+                                         int flags)
+{
+    return bdrv_co_preadv(bs->file, offset, bytes, qiov, flags);
+}
+
+static coroutine_fn int pcache_co_pwritev(BlockDriverState *bs, uint64_t offset,
+                                          uint64_t bytes, QEMUIOVector *qiov,
+                                          int flags)
+{
+    return bdrv_co_pwritev(bs->file, offset, bytes, qiov, flags);
+}
+
+static int pcache_file_open(BlockDriverState *bs, QDict *options, int flags,
+                            Error **errp)
+{
+    QemuOpts *opts;
+    Error *local_err = NULL;
+    int ret = 0;
+
+    opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
+    qemu_opts_absorb_qdict(opts, options, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        ret = -EINVAL;
+        goto fail;
+    }
+
+    assert(bs->file == NULL);
+    bs->file = bdrv_open_child(NULL, options, "image", bs, &child_format,
+                               false, &local_err);
+    if (local_err) {
+        ret = -EINVAL;
+        error_propagate(errp, local_err);
+    }
+fail:
+    qemu_opts_del(opts);
+    return ret;
+}
+
+static void pcache_close(BlockDriverState *bs)
+{
+}
+
+static int64_t pcache_getlength(BlockDriverState *bs)
+{
+    return bdrv_getlength(bs->file->bs);
+}
+
+static bool pcache_recurse_is_first_non_filter(BlockDriverState *bs,
+                                               BlockDriverState *candidate)
+{
+    return bdrv_recurse_is_first_non_filter(bs->file->bs, candidate);
+}
+
+static BlockDriver bdrv_pcache = {
+    .format_name                        = "pcache",
+    .instance_size                      = 0,
+
+    .bdrv_file_open                     = pcache_file_open,
+    .bdrv_close                         = pcache_close,
+    .bdrv_getlength                     = pcache_getlength,
+
+    .bdrv_co_preadv                     = pcache_co_preadv,
+    .bdrv_co_pwritev                    = pcache_co_pwritev,
+
+    .is_filter                          = true,
+    .bdrv_recurse_is_first_non_filter   = pcache_recurse_is_first_non_filter,
+};
+
+static void bdrv_cache_init(void)
+{
+    bdrv_register(&bdrv_pcache);
+}
+
+block_init(bdrv_cache_init);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v2 02/18] util/rbtree: add rbtree from linux kernel
  2016-12-30 14:31 [Qemu-devel] [PATCH v2 00/18] I/O prefetch cache Pavel Butsykin
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 01/18] block/pcache: empty pcache driver filter Pavel Butsykin
@ 2016-12-30 14:31 ` Pavel Butsykin
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 03/18] util/rbcache: range-based cache core Pavel Butsykin
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Pavel Butsykin @ 2016-12-30 14:31 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: kwolf, mreitz, den, eblake, armbru

Why don't we use rbtree from glib? We need  pointer to the parent node.
For optimal implementation storing of cached chunks in the rbtree need to
get next and previous nodes and content of parent node is very useful for
effective implementation of these functions. In this implementation of
rbtree (unlike rbtree of glib) the node contains a pointer to parent node.
Moreover, this rbtree allows more flexibility to work with an algorithm
because to use rbtrees you'll have to implement your own insert and search
cores. This will avoid us to use callbacks and to drop drammatically
performances.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 MAINTAINERS                     |   7 +
 include/qemu/rbtree.h           | 107 ++++++++
 include/qemu/rbtree_augmented.h | 235 +++++++++++++++++
 util/Makefile.objs              |   1 +
 util/rbtree.c                   | 570 ++++++++++++++++++++++++++++++++++++++++
 5 files changed, 920 insertions(+)
 create mode 100644 include/qemu/rbtree.h
 create mode 100644 include/qemu/rbtree_augmented.h
 create mode 100644 util/rbtree.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 585cd5abd7..228278c1ca 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1465,6 +1465,13 @@ F: include/qemu/throttle.h
 F: util/throttle.c
 L: qemu-block@nongnu.org
 
+Red Black Trees
+M: Denis V. Lunev <den@openvz.org>
+S: Supported
+F: include/qemu/rbtree.h
+F: include/qemu/rbtree_augmented.h
+F: util/rbtree.c
+
 UUID
 M: Fam Zheng <famz@redhat.com>
 S: Supported
diff --git a/include/qemu/rbtree.h b/include/qemu/rbtree.h
new file mode 100644
index 0000000000..d2e3fdd149
--- /dev/null
+++ b/include/qemu/rbtree.h
@@ -0,0 +1,107 @@
+/*
+  Red Black Trees
+  (C) 1999  Andrea Arcangeli <andrea@suse.de>
+
+  This program is free software; you can redistribute it and/or modify
+  it under the terms of the GNU General Public License as published by
+  the Free Software Foundation; either version 2 of the License, or
+  (at your option) any later version.
+
+  This program is distributed in the hope that it will be useful,
+  but WITHOUT ANY WARRANTY; without even the implied warranty of
+  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+  GNU General Public License for more details.
+
+  You should have received a copy of the GNU General Public License
+  along with this program; if not, write to the Free Software
+  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+
+  include/qemu/rbtree.h
+
+  To use rbtrees you'll have to implement your own insert and search cores.
+  This will avoid us to use callbacks and to drop drammatically performances.
+  I know it's not the cleaner way,  but in C (not in C++) to get
+  performances and genericity...
+*/
+
+#ifndef QEMU_RBTREE_H
+#define QEMU_RBTREE_H
+
+#include <unistd.h>
+#include <stdint.h>
+#include <stdbool.h>
+
+struct RbNode {
+    uintptr_t __rb_parent_color;
+    struct RbNode *rb_right;
+    struct RbNode *rb_left;
+} __attribute__((aligned(sizeof(uintptr_t))));
+    /* The alignment might seem pointless, but allegedly CRIS needs it */
+
+struct RbRoot {
+    struct RbNode *rb_node;
+};
+
+
+#define RB_PARENT(r) ((struct RbNode *)((r)->__rb_parent_color & ~3))
+
+#define RB_ROOT (struct RbRoot) { NULL, }
+#define RB_ENTRY(ptr, type, member) container_of(ptr, type, member)
+
+#define RB_EMPTY_ROOT(root) ((root)->rb_node == NULL)
+
+/* 'empty' nodes are nodes that are known not to be inserted in an rbtree */
+#define RB_EMPTY_NODE(node)  \
+    ((node)->__rb_parent_color == (uintptr_t)(node))
+#define RB_CLEAR_NODE(node)  \
+    ((node)->__rb_parent_color = (uintptr_t)(node))
+
+
+extern void rb_insert_color(struct RbNode *, struct RbRoot *);
+extern void rb_erase(struct RbNode *, struct RbRoot *);
+
+
+/* Find logical next and previous nodes in a tree */
+extern struct RbNode *rb_next(const struct RbNode *);
+extern struct RbNode *rb_prev(const struct RbNode *);
+extern struct RbNode *rb_first(const struct RbRoot *);
+extern struct RbNode *rb_last(const struct RbRoot *);
+
+/* Postorder iteration - always visit the parent after its children */
+extern struct RbNode *rb_first_postorder(const struct RbRoot *);
+extern struct RbNode *rb_next_postorder(const struct RbNode *);
+
+/* Fast replacement of a single node without remove/rebalance/add/rebalance */
+extern void rb_replace_node(struct RbNode *victim, struct RbNode *new,
+                            struct RbRoot *root);
+
+static inline void rb_link_node(struct RbNode *node, struct RbNode *parent,
+                                struct RbNode **rb_link)
+{
+    node->__rb_parent_color = (uintptr_t)parent;
+    node->rb_left = node->rb_right = NULL;
+
+    *rb_link = node;
+}
+
+#define RB_ENTRY_SAFE(ptr, type, member)                 \
+    ({ typeof(ptr) ____ptr = (ptr);                      \
+       ____ptr ? rb_entry(____ptr, type, member) : NULL; \
+    })
+
+/**
+ * rbtree_postorder_for_each_entry_safe - iterate over rb_root in post order of
+ * given type safe against removal of rb_node entry
+ *
+ * @pos:   the 'type *' to use as a loop cursor.
+ * @n:     another 'type *' to use as temporary storage
+ * @root:  'rb_root *' of the rbtree.
+ * @field: the name of the rb_node field within 'type'.
+ */
+#define RBTREE_POSTORDER_FOR_EACH_ENTRY_SAFE(pos, n, root, field)            \
+    for (pos = rb_entry_safe(rb_first_postorder(root), typeof(*pos), field); \
+         pos && ({ n = rb_entry_safe(rb_next_postorder(&pos->field),         \
+         typeof(*pos), field); 1; });                                        \
+         pos = n)
+
+#endif  /* QEMU_RBTREE_H */
diff --git a/include/qemu/rbtree_augmented.h b/include/qemu/rbtree_augmented.h
new file mode 100644
index 0000000000..f218d31e96
--- /dev/null
+++ b/include/qemu/rbtree_augmented.h
@@ -0,0 +1,235 @@
+/*
+  Red Black Trees
+  (C) 1999  Andrea Arcangeli <andrea@suse.de>
+  (C) 2002  David Woodhouse <dwmw2@infradead.org>
+  (C) 2012  Michel Lespinasse <walken@google.com>
+
+  This program is free software; you can redistribute it and/or modify
+  it under the terms of the GNU General Public License as published by
+  the Free Software Foundation; either version 2 of the License, or
+  (at your option) any later version.
+
+  This program is distributed in the hope that it will be useful,
+  but WITHOUT ANY WARRANTY; without even the implied warranty of
+  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+  GNU General Public License for more details.
+
+  You should have received a copy of the GNU General Public License
+  along with this program; if not, write to the Free Software
+  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+
+  include/qemu/rbtree_augmented.h
+*/
+
+#ifndef QEMU_RBTREE_AUGMENTED_H
+#define QEMU_RBTREE_AUGMENTED_H
+
+#include "qemu/compiler.h"
+#include "qemu/rbtree.h"
+
+/*
+ * Please note - only struct RbAugmentCallbacks and the prototypes for
+ * rb_insert_augmented() and rb_erase_augmented() are intended to be public.
+ * The rest are implementation details you are not expected to depend on.
+ */
+
+struct RbAugmentCallbacks {
+    void (*propagate)(struct RbNode *node, struct RbNode *stop);
+    void (*copy)(struct RbNode *old, struct RbNode *new);
+    void (*rotate)(struct RbNode *old, struct RbNode *new);
+};
+
+extern void __rb_insert_augmented(struct RbNode *node, struct RbRoot *root,
+    void (*augment_rotate)(struct RbNode *old, struct RbNode *new));
+/*
+ * Fixup the rbtree and update the augmented information when rebalancing.
+ *
+ * On insertion, the user must update the augmented information on the path
+ * leading to the inserted node, then call rb_link_node() as usual and
+ * rb_augment_inserted() instead of the usual rb_insert_color() call.
+ * If rb_augment_inserted() rebalances the rbtree, it will callback into
+ * a user provided function to update the augmented information on the
+ * affected subtrees.
+ */
+static inline void
+rb_insert_augmented(struct RbNode *node, struct RbRoot *root,
+                    const struct RbAugmentCallbacks *augment)
+{
+    __rb_insert_augmented(node, root, augment->rotate);
+}
+
+#define RB_DECLARE_CALLBACKS(rbstatic, rbname, rbstruct, rbfield, \
+                             rbtype, rbaugmented, rbcompute)      \
+static inline void                                                \
+rbname ## _propagate(struct RbNode *rb, struct RbNode *stop)      \
+{                                                                 \
+    while (rb != stop) {                                          \
+        rbstruct *node = rb_entry(rb, rbstruct, rbfield);         \
+        rbtype augmented = rbcompute(node);                       \
+        if (node->rbaugmented == augmented) {                     \
+            break;                                                \
+        }                                                         \
+        node->rbaugmented = augmented;                            \
+        rb = rb_parent(&node->rbfield);                           \
+    }                                                             \
+}                                                                 \
+static inline void                                                \
+rbname ## _copy(struct RbNode *rb_old, struct RbNode *rb_new)     \
+{                                                                 \
+    rbstruct *old = rb_entry(rb_old, rbstruct, rbfield);          \
+    rbstruct *new = rb_entry(rb_new, rbstruct, rbfield);          \
+    new->rbaugmented = old->rbaugmented;                          \
+}                                                                 \
+static void                                                       \
+rbname ## _rotate(struct RbNode *rb_old, struct RbNode *rb_new)   \
+{                                                                 \
+    rbstruct *old = rb_entry(rb_old, rbstruct, rbfield);          \
+    rbstruct *new = rb_entry(rb_new, rbstruct, rbfield);          \
+    new->rbaugmented = old->rbaugmented;                          \
+    old->rbaugmented = rbcompute(old);                            \
+}                                                                 \
+rbstatic const struct RbAugmentCallbacks rbname = {               \
+    rbname ## _propagate, rbname ## _copy, rbname ## _rotate      \
+};
+
+
+#define RB_RED   0
+#define RB_BLACK 1
+
+#define __RB_PARENT(pc)    ((struct RbNode *)(pc & ~3))
+
+#define __RB_COLOR(pc)     ((pc) & 1)
+#define __RB_IS_BLACK(pc)  __RB_COLOR(pc)
+#define __RB_IS_RED(pc)    (!__RB_COLOR(pc))
+#define RB_COLOR(rb)       __RB_COLOR((rb)->__rb_parent_color)
+#define RB_IS_RED(rb)      __RB_IS_RED((rb)->__rb_parent_color)
+#define RB_IS_BLACK(rb)    __RB_IS_BLACK((rb)->__rb_parent_color)
+
+static inline void rb_set_parent(struct RbNode *rb, struct RbNode *p)
+{
+    rb->__rb_parent_color = RB_COLOR(rb) | (uintptr_t)p;
+}
+
+static inline void rb_set_parent_color(struct RbNode *rb,
+                                       struct RbNode *p, int color)
+{
+    rb->__rb_parent_color = (uintptr_t)p | color;
+}
+
+static inline void
+__rb_change_child(struct RbNode *old, struct RbNode *new,
+                  struct RbNode *parent, struct RbRoot *root)
+{
+    if (parent) {
+        if (parent->rb_left == old) {
+            parent->rb_left = new;
+        } else {
+            parent->rb_right = new;
+        }
+    } else {
+        root->rb_node = new;
+    }
+}
+
+extern void __rb_erase_color(struct RbNode *parent, struct RbRoot *root,
+    void (*augment_rotate)(struct RbNode *old, struct RbNode *new));
+
+static inline struct RbNode *
+__rb_erase_augmented(struct RbNode *node, struct RbRoot *root,
+                     const struct RbAugmentCallbacks *augment)
+{
+    struct RbNode *child = node->rb_right, *tmp = node->rb_left;
+    struct RbNode *parent, *rebalance;
+    uintptr_t pc;
+
+    if (!tmp) {
+        /*
+         * Case 1: node to erase has no more than 1 child (easy!)
+         *
+         * Note that if there is one child it must be red due to 5)
+         * and node must be black due to 4). We adjust colors locally
+         * so as to bypass __rb_erase_color() later on.
+         */
+        pc = node->__rb_parent_color;
+        parent = __RB_PARENT(pc);
+        __rb_change_child(node, child, parent, root);
+        if (child) {
+            child->__rb_parent_color = pc;
+            rebalance = NULL;
+        } else {
+            rebalance = __RB_IS_BLACK(pc) ? parent : NULL;
+        }
+        tmp = parent;
+    } else if (!child) {
+        /* Still case 1, but this time the child is node->rb_left */
+        tmp->__rb_parent_color = pc = node->__rb_parent_color;
+        parent = __RB_PARENT(pc);
+        __rb_change_child(node, tmp, parent, root);
+        rebalance = NULL;
+        tmp = parent;
+    } else {
+        struct RbNode *successor = child, *child2;
+        tmp = child->rb_left;
+        if (!tmp) {
+            /*
+             * Case 2: node's successor is its right child
+             *
+             *    (n)          (s)
+             *    / \          / \
+             *  (x) (s)  ->  (x) (c)
+             *        \
+             *        (c)
+             */
+            parent = successor;
+            child2 = successor->rb_right;
+            augment->copy(node, successor);
+        } else {
+            /*
+             * Case 3: node's successor is leftmost under
+             * node's right child subtree
+             *
+             *    (n)          (s)
+             *    / \          / \
+             *  (x) (y)  ->  (x) (y)
+             *      /            /
+             *    (p)          (p)
+             *    /            /
+             *  (s)          (c)
+             *    \
+             *    (c)
+             */
+            do {
+                parent = successor;
+                successor = tmp;
+                tmp = tmp->rb_left;
+            } while (tmp);
+            parent->rb_left = child2 = successor->rb_right;
+            successor->rb_right = child;
+            rb_set_parent(child, successor);
+            augment->copy(node, successor);
+            augment->propagate(parent, successor);
+        }
+
+        successor->rb_left = tmp = node->rb_left;
+        rb_set_parent(tmp, successor);
+
+        pc = node->__rb_parent_color;
+        tmp = __RB_PARENT(pc);
+        __rb_change_child(node, successor, tmp, root);
+        if (child2) {
+            successor->__rb_parent_color = pc;
+            rb_set_parent_color(child2, parent, RB_BLACK);
+            rebalance = NULL;
+        } else {
+            unsigned long pc2 = successor->__rb_parent_color;
+            successor->__rb_parent_color = pc;
+            rebalance = __RB_IS_BLACK(pc2) ? parent : NULL;
+        }
+        tmp = successor;
+    }
+
+    augment->propagate(tmp, NULL);
+    return rebalance;
+}
+
+#endif /* QEMU_RBTREE_AUGMENTED_H */
diff --git a/util/Makefile.objs b/util/Makefile.objs
index ad0f9c7fe4..a5607cb88f 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -36,3 +36,4 @@ util-obj-y += log.o
 util-obj-y += qdist.o
 util-obj-y += qht.o
 util-obj-y += range.o
+util-obj-y += rbtree.o
diff --git a/util/rbtree.c b/util/rbtree.c
new file mode 100644
index 0000000000..704dcea60a
--- /dev/null
+++ b/util/rbtree.c
@@ -0,0 +1,570 @@
+/*
+  Red Black Trees
+  (C) 1999  Andrea Arcangeli <andrea@suse.de>
+  (C) 2002  David Woodhouse <dwmw2@infradead.org>
+  (C) 2012  Michel Lespinasse <walken@google.com>
+
+  This program is free software; you can redistribute it and/or modify
+  it under the terms of the GNU General Public License as published by
+  the Free Software Foundation; either version 2 of the License, or
+  (at your option) any later version.
+
+  This program is distributed in the hope that it will be useful,
+  but WITHOUT ANY WARRANTY; without even the implied warranty of
+  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+  GNU General Public License for more details.
+
+  You should have received a copy of the GNU General Public License
+  along with this program; if not, write to the Free Software
+  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+*/
+
+#include "qemu/rbtree_augmented.h"
+
+/*
+ * red-black trees properties:  http://en.wikipedia.org/wiki/Rbtree
+ *
+ *  1) A node is either red or black
+ *  2) The root is black
+ *  3) All leaves (NULL) are black
+ *  4) Both children of every red node are black
+ *  5) Every simple path from root to leaves contains the same number
+ *     of black nodes.
+ *
+ *  4 and 5 give the O(log n) guarantee, since 4 implies you cannot have two
+ *  consecutive red nodes in a path and every red node is therefore followed by
+ *  a black. So if B is the number of black nodes on every simple path (as per
+ *  5), then the longest possible path due to 4 is 2B.
+ *
+ *  We shall indicate color with case, where black nodes are uppercase and red
+ *  nodes will be lowercase. Unknown color nodes shall be drawn as red within
+ *  parentheses and have some accompanying text comment.
+ */
+
+static inline void rb_set_black(struct RbNode *rb)
+{
+    rb->__rb_parent_color |= RB_BLACK;
+}
+
+static inline struct RbNode *rb_red_parent(struct RbNode *red)
+{
+    return (struct RbNode *)red->__rb_parent_color;
+}
+
+/*
+ * Helper function for rotations:
+ * - old's parent and color get assigned to new
+ * - old gets assigned new as a parent and 'color' as a color.
+ */
+static inline void
+__rb_rotate_set_parents(struct RbNode *old, struct RbNode *new,
+                        struct RbRoot *root, int color)
+{
+    struct RbNode *parent = RB_PARENT(old);
+    new->__rb_parent_color = old->__rb_parent_color;
+    rb_set_parent_color(old, new, color);
+    __rb_change_child(old, new, parent, root);
+}
+
+static inline void
+__rb_insert(struct RbNode *node, struct RbRoot *root,
+            void (*augment_rotate)(struct RbNode *old, struct RbNode *new))
+{
+    struct RbNode *parent = rb_red_parent(node), *gparent, *tmp;
+
+    while (true) {
+        /*
+        * Loop invariant: node is red
+        *
+        * If there is a black parent, we are done.
+        * Otherwise, take some corrective action as we don't
+        * want a red root or two consecutive red nodes.
+        */
+        if (!parent) {
+            rb_set_parent_color(node, NULL, RB_BLACK);
+            break;
+        } else if (RB_IS_BLACK(parent)) {
+            break;
+        }
+
+        gparent = rb_red_parent(parent);
+
+        tmp = gparent->rb_right;
+        if (parent != tmp) {    /* parent == gparent->rb_left */
+            if (tmp && RB_IS_RED(tmp)) {
+                /*
+                 * Case 1 - color flips
+                 *
+                 *       G            g
+                 *      / \          / \
+                 *     p   u  -->   P   U
+                 *    /            /
+                 *   n            n
+                 *
+                 * However, since g's parent might be red, and
+                 * 4) does not allow this, we need to recurse
+                 * at g.
+                 */
+                rb_set_parent_color(tmp, gparent, RB_BLACK);
+                rb_set_parent_color(parent, gparent, RB_BLACK);
+                node = gparent;
+                parent = RB_PARENT(node);
+                rb_set_parent_color(node, parent, RB_RED);
+                continue;
+            }
+
+            tmp = parent->rb_right;
+            if (node == tmp) {
+                /*
+                 * Case 2 - left rotate at parent
+                 *
+                 *      G             G
+                 *     / \           / \
+                 *    p   U  -->    n   U
+                 *     \           /
+                 *      n         p
+                 *
+                 * This still leaves us in violation of 4), the
+                 * continuation into Case 3 will fix that.
+                 */
+                parent->rb_right = tmp = node->rb_left;
+                node->rb_left = parent;
+                if (tmp) {
+                    rb_set_parent_color(tmp, parent, RB_BLACK);
+                }
+                rb_set_parent_color(parent, node, RB_RED);
+                augment_rotate(parent, node);
+                parent = node;
+                tmp = node->rb_right;
+            }
+
+            /*
+             * Case 3 - right rotate at gparent
+             *
+             *        G           P
+             *       / \         / \
+             *      p   U  -->  n   g
+             *     /                 \
+             *    n                   U
+             */
+            gparent->rb_left = tmp;  /* == parent->rb_right */
+            parent->rb_right = gparent;
+            if (tmp) {
+                rb_set_parent_color(tmp, gparent, RB_BLACK);
+            }
+            __rb_rotate_set_parents(gparent, parent, root, RB_RED);
+            augment_rotate(gparent, parent);
+            break;
+        } else {
+            tmp = gparent->rb_left;
+            if (tmp && RB_IS_RED(tmp)) {
+                /* Case 1 - color flips */
+                rb_set_parent_color(tmp, gparent, RB_BLACK);
+                rb_set_parent_color(parent, gparent, RB_BLACK);
+                node = gparent;
+                parent = RB_PARENT(node);
+                rb_set_parent_color(node, parent, RB_RED);
+                continue;
+            }
+
+            tmp = parent->rb_left;
+            if (node == tmp) {
+                /* Case 2 - right rotate at parent */
+                parent->rb_left = tmp = node->rb_right;
+                node->rb_right = parent;
+                if (tmp) {
+                    rb_set_parent_color(tmp, parent, RB_BLACK);
+                }
+                rb_set_parent_color(parent, node, RB_RED);
+                augment_rotate(parent, node);
+                parent = node;
+                tmp = node->rb_left;
+            }
+
+            /* Case 3 - left rotate at gparent */
+            gparent->rb_right = tmp;  /* == parent->rb_left */
+            parent->rb_left = gparent;
+            if (tmp) {
+                rb_set_parent_color(tmp, gparent, RB_BLACK);
+            }
+            __rb_rotate_set_parents(gparent, parent, root, RB_RED);
+            augment_rotate(gparent, parent);
+            break;
+        }
+    }
+}
+
+/*
+ * Inline version for rb_erase() use - we want to be able to inline
+ * and eliminate the dummy_rotate callback there
+ */
+static inline void
+____rb_erase_color(struct RbNode *parent, struct RbRoot *root,
+                   void (*augment_rotate)(struct RbNode *old,
+                                          struct RbNode *new))
+{
+    struct RbNode *node = NULL, *sibling, *tmp1, *tmp2;
+
+    while (true) {
+        /*
+         * Loop invariants:
+         * - node is black (or NULL on first iteration)
+         * - node is not the root (parent is not NULL)
+         * - All leaf paths going through parent and node have a
+         *   black node count that is 1 lower than other leaf paths.
+         */
+        sibling = parent->rb_right;
+        if (node != sibling) {    /* node == parent->rb_left */
+            if (RB_IS_RED(sibling)) {
+                /*
+                 * Case 1 - left rotate at parent
+                 *
+                 *     P               S
+                 *    / \             / \
+                 *   N   s    -->    p   Sr
+                 *      / \         / \
+                 *     Sl  Sr      N   Sl
+                 */
+                parent->rb_right = tmp1 = sibling->rb_left;
+                sibling->rb_left = parent;
+                rb_set_parent_color(tmp1, parent, RB_BLACK);
+                __rb_rotate_set_parents(parent, sibling, root,
+                            RB_RED);
+                augment_rotate(parent, sibling);
+                sibling = tmp1;
+            }
+            tmp1 = sibling->rb_right;
+            if (!tmp1 || RB_IS_BLACK(tmp1)) {
+                tmp2 = sibling->rb_left;
+                if (!tmp2 || RB_IS_BLACK(tmp2)) {
+                    /*
+                     * Case 2 - sibling color flip
+                     * (p could be either color here)
+                     *
+                     *    (p)           (p)
+                     *    / \           / \
+                     *   N   S    -->  N   s
+                     *      / \           / \
+                     *     Sl  Sr        Sl  Sr
+                     *
+                     * This leaves us violating 5) which
+                     * can be fixed by flipping p to black
+                     * if it was red, or by recursing at p.
+                     * p is red when coming from Case 1.
+                     */
+                    rb_set_parent_color(sibling, parent,
+                                RB_RED);
+                    if (RB_IS_RED(parent)) {
+                        rb_set_black(parent);
+                    } else {
+                        node = parent;
+                        parent = RB_PARENT(node);
+                        if (parent) {
+                            continue;
+                        }
+                    }
+                    break;
+                }
+                /*
+                 * Case 3 - right rotate at sibling
+                 * (p could be either color here)
+                 *
+                 *   (p)           (p)
+                 *   / \           / \
+                 *  N   S    -->  N   Sl
+                 *     / \             \
+                 *    sl  Sr            s
+                 *                       \
+                 *                        Sr
+                 */
+                sibling->rb_left = tmp1 = tmp2->rb_right;
+                tmp2->rb_right = sibling;
+                parent->rb_right = tmp2;
+                if (tmp1) {
+                    rb_set_parent_color(tmp1, sibling, RB_BLACK);
+                }
+                augment_rotate(sibling, tmp2);
+                tmp1 = sibling;
+                sibling = tmp2;
+            }
+            /*
+             * Case 4 - left rotate at parent + color flips
+             * (p and sl could be either color here.
+             *  After rotation, p becomes black, s acquires
+             *  p's color, and sl keeps its color)
+             *
+             *      (p)             (s)
+             *      / \             / \
+             *     N   S     -->   P   Sr
+             *        / \         / \
+             *      (sl) sr      N  (sl)
+             */
+            parent->rb_right = tmp2 = sibling->rb_left;
+            sibling->rb_left = parent;
+            rb_set_parent_color(tmp1, sibling, RB_BLACK);
+            if (tmp2) {
+                rb_set_parent(tmp2, parent);
+            }
+            __rb_rotate_set_parents(parent, sibling, root,
+                        RB_BLACK);
+            augment_rotate(parent, sibling);
+            break;
+        } else {
+            sibling = parent->rb_left;
+            if (RB_IS_RED(sibling)) {
+                /* Case 1 - right rotate at parent */
+                parent->rb_left = tmp1 = sibling->rb_right;
+                sibling->rb_right = parent;
+                rb_set_parent_color(tmp1, parent, RB_BLACK);
+                __rb_rotate_set_parents(parent, sibling, root,
+                            RB_RED);
+                augment_rotate(parent, sibling);
+                sibling = tmp1;
+            }
+            tmp1 = sibling->rb_left;
+            if (!tmp1 || RB_IS_BLACK(tmp1)) {
+                tmp2 = sibling->rb_right;
+                if (!tmp2 || RB_IS_BLACK(tmp2)) {
+                    /* Case 2 - sibling color flip */
+                    rb_set_parent_color(sibling, parent,
+                                RB_RED);
+                    if (RB_IS_RED(parent)) {
+                        rb_set_black(parent);
+                    } else {
+                        node = parent;
+                        parent = RB_PARENT(node);
+                        if (parent) {
+                            continue;
+                        }
+                    }
+                    break;
+                }
+                /* Case 3 - right rotate at sibling */
+                sibling->rb_right = tmp1 = tmp2->rb_left;
+                tmp2->rb_left = sibling;
+                parent->rb_left = tmp2;
+                if (tmp1) {
+                    rb_set_parent_color(tmp1, sibling,
+                                RB_BLACK);
+                }
+                augment_rotate(sibling, tmp2);
+                tmp1 = sibling;
+                sibling = tmp2;
+            }
+            /* Case 4 - left rotate at parent + color flips */
+            parent->rb_left = tmp2 = sibling->rb_right;
+            sibling->rb_right = parent;
+            rb_set_parent_color(tmp1, sibling, RB_BLACK);
+            if (tmp2) {
+                rb_set_parent(tmp2, parent);
+            }
+            __rb_rotate_set_parents(parent, sibling, root,
+                        RB_BLACK);
+            augment_rotate(parent, sibling);
+            break;
+        }
+    }
+}
+
+/* Non-inline version for rb_erase_augmented() use */
+void __rb_erase_color(struct RbNode *parent, struct RbRoot *root,
+                      void (*augment_rotate)(struct RbNode *old,
+                                             struct RbNode *new))
+{
+    ____rb_erase_color(parent, root, augment_rotate);
+}
+
+/*
+ * Non-augmented rbtree manipulation functions.
+ *
+ * We use dummy augmented callbacks here, and have the compiler optimize them
+ * out of the rb_insert_color() and rb_erase() function definitions.
+ */
+
+static inline void dummy_propagate(struct RbNode *node, struct RbNode *stop) {}
+static inline void dummy_copy(struct RbNode *old, struct RbNode *new) {}
+static inline void dummy_rotate(struct RbNode *old, struct RbNode *new) {}
+
+static const struct RbAugmentCallbacks dummy_callbacks = {
+    dummy_propagate, dummy_copy, dummy_rotate
+};
+
+void rb_insert_color(struct RbNode *node, struct RbRoot *root)
+{
+    __rb_insert(node, root, dummy_rotate);
+}
+
+void rb_erase(struct RbNode *node, struct RbRoot *root)
+{
+    struct RbNode *rebalance;
+    rebalance = __rb_erase_augmented(node, root, &dummy_callbacks);
+    if (rebalance) {
+        ____rb_erase_color(rebalance, root, dummy_rotate);
+    }
+}
+
+/*
+ * Augmented rbtree manipulation functions.
+ *
+ * This instantiates the same __always_inline functions as in the non-augmented
+ * case, but this time with user-defined callbacks.
+ */
+
+void __rb_insert_augmented(struct RbNode *node, struct RbRoot *root,
+                           void (*augment_rotate)(struct RbNode *old,
+                                                  struct RbNode *new))
+{
+    __rb_insert(node, root, augment_rotate);
+}
+
+/*
+ * This function returns the first node (in sort order) of the tree.
+ */
+struct RbNode *rb_first(const struct RbRoot *root)
+{
+    struct RbNode    *n;
+
+    n = root->rb_node;
+    if (!n) {
+        return NULL;
+    }
+    while (n->rb_left) {
+        n = n->rb_left;
+    }
+    return n;
+}
+
+struct RbNode *rb_last(const struct RbRoot *root)
+{
+    struct RbNode    *n;
+
+    n = root->rb_node;
+    if (!n) {
+        return NULL;
+    }
+    while (n->rb_right) {
+        n = n->rb_right;
+    }
+    return n;
+}
+
+struct RbNode *rb_next(const struct RbNode *node)
+{
+    struct RbNode *parent;
+
+    if (RB_EMPTY_NODE(node)) {
+        return NULL;
+    }
+
+    /*
+     * If we have a right-hand child, go down and then left as far
+     * as we can.
+     */
+    if (node->rb_right) {
+        node = node->rb_right;
+        while (node->rb_left) {
+            node = node->rb_left;
+        }
+        return (struct RbNode *)node;
+    }
+
+    /*
+     * No right-hand children. Everything down and left is smaller than us,
+     * so any 'next' node must be in the general direction of our parent.
+     * Go up the tree; any time the ancestor is a right-hand child of its
+     * parent, keep going up. First time it's a left-hand child of its
+     * parent, said parent is our 'next' node.
+     */
+    while ((parent = RB_PARENT(node)) && node == parent->rb_right) {
+        node = parent;
+    }
+    return parent;
+}
+
+struct RbNode *rb_prev(const struct RbNode *node)
+{
+    struct RbNode *parent;
+
+    if (RB_EMPTY_NODE(node)) {
+        return NULL;
+    }
+
+    /*
+     * If we have a left-hand child, go down and then right as far
+     * as we can.
+     */
+    if (node->rb_left) {
+        node = node->rb_left;
+        while (node->rb_right) {
+            node = node->rb_right;
+        }
+        return (struct RbNode *)node;
+    }
+
+    /*
+     * No left-hand children. Go up till we find an ancestor which
+     * is a right-hand child of its parent.
+     */
+    while ((parent = RB_PARENT(node)) && node == parent->rb_left) {
+        node = parent;
+    }
+    return parent;
+}
+
+void rb_replace_node(struct RbNode *victim, struct RbNode *new,
+             struct RbRoot *root)
+{
+    struct RbNode *parent = RB_PARENT(victim);
+
+    /* Set the surrounding nodes to point to the replacement */
+    __rb_change_child(victim, new, parent, root);
+    if (victim->rb_left) {
+        rb_set_parent(victim->rb_left, new);
+    }
+    if (victim->rb_right) {
+        rb_set_parent(victim->rb_right, new);
+    }
+    /* Copy the pointers/colour from the victim to the replacement */
+    *new = *victim;
+}
+
+static struct RbNode *rb_left_deepest_node(const struct RbNode *node)
+{
+    for (;;) {
+        if (node->rb_left) {
+            node = node->rb_left;
+        } else if (node->rb_right) {
+            node = node->rb_right;
+        } else {
+            return (struct RbNode *)node;
+        }
+    }
+}
+
+struct RbNode *rb_next_postorder(const struct RbNode *node)
+{
+    const struct RbNode *parent;
+    if (!node) {
+        return NULL;
+    }
+    parent = RB_PARENT(node);
+
+    /* If we're sitting on node, we've already seen our children */
+    if (parent && node == parent->rb_left && parent->rb_right) {
+        /* If we are the parent's left node, go to the parent's right
+         * node then all the way down to the left */
+        return rb_left_deepest_node(parent->rb_right);
+    } else {
+        /* Otherwise we are the parent's right node, and the parent
+         * should be next */
+        return (struct RbNode *)parent;
+    }
+}
+
+struct RbNode *rb_first_postorder(const struct RbRoot *root)
+{
+    if (!root->rb_node) {
+        return NULL;
+    }
+    return rb_left_deepest_node(root->rb_node);
+}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v2 03/18] util/rbcache: range-based cache core
  2016-12-30 14:31 [Qemu-devel] [PATCH v2 00/18] I/O prefetch cache Pavel Butsykin
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 01/18] block/pcache: empty pcache driver filter Pavel Butsykin
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 02/18] util/rbtree: add rbtree from linux kernel Pavel Butsykin
@ 2016-12-30 14:31 ` Pavel Butsykin
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 04/18] tests/test-rbcache: add test cases Pavel Butsykin
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Pavel Butsykin @ 2016-12-30 14:31 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: kwolf, mreitz, den, eblake, armbru

RBCache provides functionality to cache the data from block devices
(basically). The range here is used as the main key for searching and storing
data. The cache is based on red-black trees, so basic operations search,
insert, delete are performed for O(log n).

It is important to note that QEMU usually does not require a data cache, but
in reality, there are already some cases where a cache of small amounts can
increase performance, so as the data structure was selected red-black trees,
this is a fairly simple data structure and show high efficiency on a small
number of elements. Therefore, when the minimum range is 512 bytes, the
recommended size of the cache memory no more than 8-16mb.  Also note
that this cache implementation allows to store ranges of different lengths
without alignment.

Generic cache core can easily be used to implement different caching policies at
the block level, such as read-ahed. Also it can be used in some special cases,
for example for caching data in qcow2 when sequential allocating writes to image
with backing file.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 MAINTAINERS            |   6 ++
 include/qemu/rbcache.h | 128 +++++++++++++++++++++++++
 util/Makefile.objs     |   1 +
 util/rbcache.c         | 253 +++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 388 insertions(+)
 create mode 100644 include/qemu/rbcache.h
 create mode 100644 util/rbcache.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 228278c1ca..01f4afa1e4 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1472,6 +1472,12 @@ F: include/qemu/rbtree.h
 F: include/qemu/rbtree_augmented.h
 F: util/rbtree.c
 
+Range-Based Cache
+M: Denis V. Lunev <den@openvz.org>
+S: Supported
+F: include/qemu/rbcache.h
+F: util/rbcache.c
+
 UUID
 M: Fam Zheng <famz@redhat.com>
 S: Supported
diff --git a/include/qemu/rbcache.h b/include/qemu/rbcache.h
new file mode 100644
index 0000000000..24f7c1cb80
--- /dev/null
+++ b/include/qemu/rbcache.h
@@ -0,0 +1,128 @@
+/*
+ * QEMU Range-Based Cache core
+ *
+ * Copyright (C) 2015-2016 Parallels IP Holdings GmbH.
+ *
+ * Author: Pavel Butsykin <pbutsykin@virtuozzo.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+
+#ifndef RBCACHE_H
+#define RBCACHE_H
+
+#include "qemu/rbtree.h"
+#include "qemu/queue.h"
+
+typedef struct RBCacheNode {
+    struct RbNode rb_node;
+    uint64_t offset;
+    uint64_t bytes;
+    QTAILQ_ENTRY(RBCacheNode) entry;
+} RBCacheNode;
+
+typedef struct RBCache RBCache;
+
+/* These callbacks are used to extend the common structure RBCacheNode. The
+ * alloc callback should initialize only fields of the expanded node. Node
+ * common part is initialized in RBCache( see rbcache_node_alloc() ).
+ */
+typedef RBCacheNode *RBNodeAlloc(uint64_t offset, uint64_t bytes, void *opaque);
+typedef void RBNodeFree(RBCacheNode *node, void *opaque);
+
+
+enum eviction_type {
+    RBCACHE_FIFO,
+    RBCACHE_LRU,
+};
+
+/**
+ * rbcache_search:
+ * @rbcache: the cache object.
+ * @offset: the start of the range.
+ * @bytes: the size of the range.
+ *
+ * Returns the node corresponding to the range(offset, bytes), or NULL if
+ * the node was not found. In the case when the range covers multiple nodes,
+ * it returns the node with the lowest offset.
+ */
+void *rbcache_search(RBCache *rbcache, uint64_t offset, uint64_t bytes);
+
+/**
+ * rbcache_insert:
+ * @rbcache: the cache object.
+ * @node: a new node for the cache.
+ *
+ * Returns the new node, or old node if a node describing the same range
+ * already exists. In case of partial overlaps, the existing overlapping node
+ * with the lowest offset is returned.
+ */
+void *rbcache_insert(RBCache *rbcache, RBCacheNode *node);
+
+/**
+ * rbcache_search_and_insert:
+ * @rbcache: the cache object.
+ * @offset: the start of the range.
+ * @bytes: the size of the range.
+ *
+ * rbcache_search_and_insert() is like rbcache_insert(), except that a new node
+ * is allocated inside the function. Returns the new node, or old node if a node
+ * describing the same range. In case of partial overlaps, the existing
+ * overlapping node with the lowest offset is returned.
+ */
+void *rbcache_search_and_insert(RBCache *rbcache, uint64_t offset,
+                                uint64_t byte);
+
+/**
+ * rbcache_remove:
+ * @rbcache: the cache object.
+ * @node: a node to remove.
+ *
+ * Removes the cached range owned by the node, it also frees the node.
+ */
+void rbcache_remove(RBCache *rbcache, RBCacheNode *node);
+
+/**
+ * rbcache_node_alloc:
+ * @rbcache: the cache object.
+ * @offset: the start of the range.
+ * @bytes: the size of the range.
+ *
+ * Returns an allocated and initialized node.
+ */
+RBCacheNode *rbcache_node_alloc(RBCache *rbcache, uint64_t offset,
+                                uint64_t bytes);
+
+/**
+ * rbcache_node_free:
+ * @rbcache: the cache object.
+ * @node: a node to free.
+ *
+ * Frees the node.
+ */
+void rbcache_node_free(RBCache *rbcache, RBCacheNode *node);
+
+/**
+ * rbcache_create:
+ * @alloc: callback to allocation node, allows to upgrade allocate and expand
+ *         the capabilities of the node.
+ * @free: callback to release node, must be used together with alloc callback.
+ * @limit_size: maximum cache size in bytes.
+ * @eviction_type: method of memory limitation
+ * @opaque: the opaque pointer to pass to the callback.
+ *
+ * Returns the cache object.
+ */
+RBCache *rbcache_create(RBNodeAlloc *alloc, RBNodeFree *free,
+                        uint64_t limit_size, int eviction_type, void *opaque);
+
+/**
+ * rbcache_destroy:
+ * @rbcache: the cache object.
+ *
+ * Cleanup the cache object created with rbcache_create().
+ */
+void rbcache_destroy(RBCache *rbcache);
+
+#endif /* RBCACHE_H */
diff --git a/util/Makefile.objs b/util/Makefile.objs
index a5607cb88f..e9f545ddbf 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -37,3 +37,4 @@ util-obj-y += qdist.o
 util-obj-y += qht.o
 util-obj-y += range.o
 util-obj-y += rbtree.o
+util-obj-y += rbcache.o
diff --git a/util/rbcache.c b/util/rbcache.c
new file mode 100644
index 0000000000..2f1f860f76
--- /dev/null
+++ b/util/rbcache.c
@@ -0,0 +1,253 @@
+/*
+ * QEMU Range-Based Cache core
+ *
+ * Copyright (C) 2015-2016 Parallels IP Holdings GmbH.
+ *
+ * Author: Pavel Butsykin <pbutsykin@virtuozzo.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/rbcache.h"
+
+/* RBCache provides functionality to cache the data from block devices
+ * (basically). The range here is used as the main key for searching and storing
+ * data. The cache is based on red-black trees, so basic operations search,
+ * insert, delete are performed for O(log n).
+ *
+ * It is important to note that QEMU usually does not require a data cache, but
+ * in reality, there are already some cases where a cache of small amounts can
+ * increase performance, so as the data structure was selected red-black trees,
+ * this is a quite simple data structure and show high efficiency on a small
+ * number of elements. Therefore, when the minimum range is 512 bytes, the
+ * recommended size of the cache memory no more than 8-16mb. Also note that this
+ * cache implementation allows to store ranges of different lengths without
+ * alignment.
+ */
+
+struct RBCache {
+    struct RbRoot root;
+    RBNodeAlloc *alloc;
+    RBNodeFree  *free;
+    uint64_t limit_size;
+    uint64_t cur_size;
+    enum eviction_type eviction_type;
+    void *opaque;
+
+    QTAILQ_HEAD(RBCacheNodeHead, RBCacheNode) queue;
+};
+
+static int node_key_cmp(const RBCacheNode *node1, const RBCacheNode *node2)
+{
+    assert(node1 != NULL);
+    assert(node2 != NULL);
+
+    if (node1->offset >= node2->offset + node2->bytes) {
+        return 1;
+    }
+    if (node1->offset + node1->bytes <= node2->offset) {
+        return -1;
+    }
+
+    return 0;
+}
+
+/* Find leftmost node that intersects given target_offset. */
+static RBCacheNode *node_previous(RBCacheNode *node, uint64_t target_offset)
+{
+    while (node) {
+        struct RbNode *prev_rb_node = rb_prev(&node->rb_node);
+        RBCacheNode *prev_node;
+        if (prev_rb_node == NULL) {
+            break;
+        }
+        prev_node = container_of(prev_rb_node, RBCacheNode, rb_node);
+        if (prev_node->offset + prev_node->bytes <= target_offset) {
+            break;
+        }
+        node = prev_node;
+    }
+
+    assert(node != NULL);
+
+    return node;
+}
+
+RBCacheNode *rbcache_node_alloc(RBCache *rbcache, uint64_t offset,
+                                uint64_t bytes)
+{
+    RBCacheNode *node;
+
+    if (rbcache->alloc) {
+        node = rbcache->alloc(offset, bytes, rbcache->opaque);
+    } else {
+        node = g_new(RBCacheNode, 1);
+    }
+
+    node->offset = offset;
+    node->bytes  = bytes;
+
+    return node;
+}
+
+void rbcache_node_free(RBCache *rbcache, RBCacheNode *node)
+{
+    if (rbcache->free) {
+        rbcache->free(node, rbcache->opaque);
+    } else {
+        g_free(node);
+    }
+}
+
+static void rbcache_try_shrink(RBCache *rbcache)
+{
+    while (rbcache->cur_size > rbcache->limit_size) {
+        RBCacheNode *node;
+        assert(!QTAILQ_EMPTY(&rbcache->queue));
+
+        node = QTAILQ_LAST(&rbcache->queue, RBCacheNodeHead);
+
+        rbcache_remove(rbcache, node);
+    }
+}
+
+static inline void node_move_in_queue(RBCache *rbcache, RBCacheNode *node)
+{
+    if (rbcache->eviction_type == RBCACHE_LRU) {
+        QTAILQ_REMOVE(&rbcache->queue, node, entry);
+        QTAILQ_INSERT_HEAD(&rbcache->queue, node, entry);
+    }
+}
+
+/*
+ * Adds a new node to the tree if the range of the node doesn't overlap with
+ * existing nodes, and returns the new node. If the new node overlaps with
+ * another existing node, the tree is not changed and the function returns a
+ * pointer to the existing node. If the new node covers multiple nodes, then
+ * returns the leftmost node in the tree.
+ */
+static RBCacheNode *node_insert(RBCache *rbcache, RBCacheNode *node, bool alloc)
+{
+    struct RbNode **new, *parent = NULL;
+
+    assert(rbcache != NULL);
+    assert(node->bytes != 0);
+
+    /* Figure out where to put new node */
+    new = &(rbcache->root.rb_node);
+    while (*new) {
+        RBCacheNode *this = container_of(*new, RBCacheNode, rb_node);
+        int result = node_key_cmp(node, this);
+        if (result == 0) {
+            this = node_previous(this, node->offset);
+            node_move_in_queue(rbcache, this);
+            return this;
+        }
+        parent = *new;
+        new = result < 0 ? &((*new)->rb_left) : &((*new)->rb_right);
+    }
+
+    if (alloc) {
+        node = rbcache_node_alloc(rbcache, node->offset, node->bytes);
+    }
+    /* Add new node and rebalance tree. */
+    rb_link_node(&node->rb_node, parent, new);
+    rb_insert_color(&node->rb_node, &rbcache->root);
+
+    rbcache->cur_size += node->bytes;
+
+    rbcache_try_shrink(rbcache);
+
+    QTAILQ_INSERT_HEAD(&rbcache->queue, node, entry);
+
+    return node;
+}
+
+void *rbcache_search(RBCache *rbcache, uint64_t offset, uint64_t bytes)
+{
+    struct RbNode *rb_node;
+    RBCacheNode node = {
+        .offset = offset,
+        .bytes  = bytes,
+    };
+
+    assert(rbcache != NULL);
+
+    rb_node = rbcache->root.rb_node;
+    while (rb_node) {
+        RBCacheNode *this = container_of(rb_node, RBCacheNode, rb_node);
+        int32_t result = node_key_cmp(&node, this);
+        if (result == 0) {
+            this = node_previous(this, offset);
+            node_move_in_queue(rbcache, this);
+            return this;
+        }
+        rb_node = result < 0 ? rb_node->rb_left : rb_node->rb_right;
+    }
+    return NULL;
+}
+
+void *rbcache_insert(RBCache *rbcache, RBCacheNode *node)
+{
+    return node_insert(rbcache, node, false);
+}
+
+void *rbcache_search_and_insert(RBCache *rbcache, uint64_t offset,
+                                uint64_t bytes)
+{
+    RBCacheNode node = {
+        .offset = offset,
+        .bytes  = bytes,
+    };
+
+    return node_insert(rbcache, &node, true);
+}
+
+void rbcache_remove(RBCache *rbcache, RBCacheNode *node)
+{
+    assert(rbcache->cur_size >= node->bytes);
+
+    rbcache->cur_size -= node->bytes;
+    rb_erase(&node->rb_node, &rbcache->root);
+
+    QTAILQ_REMOVE(&rbcache->queue, node, entry);
+
+    rbcache_node_free(rbcache, node);
+}
+
+RBCache *rbcache_create(RBNodeAlloc *alloc, RBNodeFree *free,
+                        uint64_t limit_size, int eviction_type, void *opaque)
+{
+    RBCache *rbcache = g_new(RBCache, 1);
+
+    /* We can't use only one callback, or both or neither */
+    assert(!(!alloc ^ !free));
+
+    *rbcache = (RBCache) {
+        .root          = RB_ROOT,
+        .alloc         = alloc,
+        .free          = free,
+        .limit_size    = limit_size,
+        .eviction_type = eviction_type,
+        .opaque        = opaque,
+        .queue         = QTAILQ_HEAD_INITIALIZER(rbcache->queue),
+    };
+
+    return rbcache;
+}
+
+void rbcache_destroy(RBCache *rbcache)
+{
+    RBCacheNode *node, *next;
+
+    assert(rbcache != NULL);
+
+    QTAILQ_FOREACH_SAFE(node, &rbcache->queue, entry, next) {
+        QTAILQ_REMOVE(&rbcache->queue, node, entry);
+        rbcache_node_free(rbcache, node);
+    }
+
+    g_free(rbcache);
+}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v2 04/18] tests/test-rbcache: add test cases
  2016-12-30 14:31 [Qemu-devel] [PATCH v2 00/18] I/O prefetch cache Pavel Butsykin
                   ` (2 preceding siblings ...)
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 03/18] util/rbcache: range-based cache core Pavel Butsykin
@ 2016-12-30 14:31 ` Pavel Butsykin
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 05/18] block/pcache: statistics collection read requests Pavel Butsykin
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Pavel Butsykin @ 2016-12-30 14:31 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: kwolf, mreitz, den, eblake, armbru

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 tests/Makefile.include |   3 +
 tests/test-rbcache.c   | 431 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 434 insertions(+)
 create mode 100644 tests/test-rbcache.c

diff --git a/tests/Makefile.include b/tests/Makefile.include
index 4841d582a1..74bdcd2aac 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -55,6 +55,8 @@ check-unit-y += tests/test-hbitmap$(EXESUF)
 gcov-files-test-hbitmap-y = blockjob.c
 check-unit-y += tests/test-blockjob$(EXESUF)
 check-unit-y += tests/test-blockjob-txn$(EXESUF)
+gcov-files-test-rbcache-y = util/rbcache.c
+check-unit-y += tests/test-rbcache$(EXESUF)
 check-unit-y += tests/test-x86-cpuid$(EXESUF)
 # all code tested by test-x86-cpuid is inside topology.h
 gcov-files-test-x86-cpuid-y =
@@ -498,6 +500,7 @@ tests/test-blockjob-txn$(EXESUF): tests/test-blockjob-txn.o $(test-block-obj-y)
 tests/test-thread-pool$(EXESUF): tests/test-thread-pool.o $(test-block-obj-y)
 tests/test-iov$(EXESUF): tests/test-iov.o $(test-util-obj-y)
 tests/test-hbitmap$(EXESUF): tests/test-hbitmap.o $(test-util-obj-y)
+tests/test-rbcache$(EXESUF): tests/test-rbcache.o $(test-util-obj-y)
 tests/test-x86-cpuid$(EXESUF): tests/test-x86-cpuid.o
 tests/test-xbzrle$(EXESUF): tests/test-xbzrle.o migration/xbzrle.o page_cache.o $(test-util-obj-y)
 tests/test-cutils$(EXESUF): tests/test-cutils.o util/cutils.o
diff --git a/tests/test-rbcache.c b/tests/test-rbcache.c
new file mode 100644
index 0000000000..40ccd6b225
--- /dev/null
+++ b/tests/test-rbcache.c
@@ -0,0 +1,431 @@
+/*
+ * QEMU Range-Based Cache core
+ *
+ * Copyright (C) 2015-2016 Parallels IP Holdings GmbH.
+ *
+ * Author: Pavel Butsykin <pbutsykin@virtuozzo.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/rbcache.h"
+
+typedef struct TestRBCacheData {
+    RBCache *cache;
+} TestRBCacheData;
+
+typedef struct TestRBCacheConfig {
+    uint64_t limit_size;
+    int eviction_type;
+    RBNodeAlloc *alloc;
+    RBNodeFree  *free;
+    void *opaque;
+} TestRBCacheConfig;
+
+#define KB(_n) ((_n) << 10)
+#define MB(_n) ((_n) << 20)
+
+#define OFFSET1 0
+#define SIZE1   KB(1)
+
+#define OFFSET2 KB(1)
+#define SIZE2   KB(2)
+
+#define OFFSET3 KB(18)
+#define SIZE3   KB(1)
+
+#define OFFSET4 KB(7)
+#define SIZE4   KB(7)
+
+#define OFFSET5 KB(1)
+#define SIZE5   KB(4)
+
+#define OFFSET6 KB(5)
+#define SIZE6   KB(5)
+
+#define OFFSET7 KB(15)
+#define SIZE7   KB(20)
+
+#define OFFSET8 KB(2)
+#define SIZE8   KB(20)
+
+
+static void test_rbcache_init(TestRBCacheData *data, const void *ctx)
+{
+    g_assert_nonnull(data->cache);
+}
+
+static void test_rbcache_insert(TestRBCacheData *data, const void *ctx)
+{
+    RBCacheNode *node1 = rbcache_node_alloc(data->cache, OFFSET1, SIZE1);
+    RBCacheNode *node2 = rbcache_node_alloc(data->cache, OFFSET2, SIZE2);
+    RBCacheNode *node3 = rbcache_node_alloc(data->cache, OFFSET3, SIZE3);
+    RBCacheNode *node4 = rbcache_node_alloc(data->cache, OFFSET4, SIZE4);
+    RBCacheNode *node5 = rbcache_node_alloc(data->cache, OFFSET5, SIZE5);
+    RBCacheNode *node6 = rbcache_node_alloc(data->cache, OFFSET6, SIZE6);
+    RBCacheNode *node7 = rbcache_node_alloc(data->cache, OFFSET7, SIZE7);
+    RBCacheNode *node8 = rbcache_node_alloc(data->cache, OFFSET8, SIZE8);
+    RBCacheNode *node;
+
+    node = rbcache_insert(data->cache, node2);
+    g_assert_true(node == node2);
+
+    node = rbcache_insert(data->cache, node1);
+    g_assert_true(node == node1);
+
+    node = rbcache_insert(data->cache, node3);
+    g_assert_true(node == node3);
+
+    node = rbcache_insert(data->cache, node4);
+    g_assert_true(node == node4);
+
+    node = rbcache_insert(data->cache, node5);
+    g_assert_true(node == node2);
+    rbcache_node_free(data->cache, node5);
+
+    node = rbcache_insert(data->cache, node6);
+    g_assert_true(node == node4);
+    rbcache_node_free(data->cache, node6);
+
+    node = rbcache_insert(data->cache, node7);
+    g_assert_true(node == node3);
+    rbcache_node_free(data->cache, node7);
+
+    node = rbcache_insert(data->cache, node8);
+    g_assert_true(node == node2);
+    rbcache_node_free(data->cache, node8);
+}
+
+static void test_rbcache_search(TestRBCacheData *data, const void *ctx)
+{
+    RBCacheNode *node;
+
+    test_rbcache_insert(data, ctx);
+
+    node = rbcache_search(data->cache, OFFSET1, SIZE1);
+    g_assert_nonnull(node);
+    g_assert_cmpuint(node->offset, ==, OFFSET1);
+    g_assert_cmpuint(node->bytes, ==, SIZE1);
+
+    node = rbcache_search(data->cache, OFFSET2 + KB(1), SIZE2);
+    g_assert_nonnull(node);
+    g_assert_cmpuint(node->offset, ==, OFFSET2);
+    g_assert_cmpuint(node->bytes, ==, SIZE2);
+
+    node = rbcache_search(data->cache, OFFSET8, SIZE8);
+    g_assert_nonnull(node);
+    g_assert_cmpuint(node->offset, ==, OFFSET2);
+    g_assert_cmpuint(node->bytes, ==, SIZE2);
+
+    node = rbcache_search(data->cache, OFFSET8 + KB(2), SIZE5);
+    g_assert_nonnull(node);
+    g_assert_cmpuint(node->offset, ==, OFFSET4);
+    g_assert_cmpuint(node->bytes, ==, OFFSET4);
+
+    node = rbcache_search(data->cache, OFFSET3 + SIZE3, SIZE3);
+    g_assert_null(node);
+}
+
+static void test_rbcache_search_and_insert(TestRBCacheData *data,
+                                           const void *ctx)
+{
+    RBCacheNode *node;
+
+    node = rbcache_search_and_insert(data->cache, OFFSET2, SIZE2);
+    g_assert_nonnull(node);
+    g_assert_cmpuint(node->offset, ==, OFFSET2);
+    g_assert_cmpuint(node->bytes, ==, SIZE2);
+
+    node = rbcache_search_and_insert(data->cache, OFFSET1, SIZE1);
+    g_assert_nonnull(node);
+    g_assert_cmpuint(node->offset, ==, OFFSET1);
+    g_assert_cmpuint(node->bytes, ==, SIZE1);
+
+    node = rbcache_search_and_insert(data->cache, OFFSET3, SIZE3);
+    g_assert_nonnull(node);
+    g_assert_cmpuint(node->offset, ==, OFFSET3);
+    g_assert_cmpuint(node->bytes, ==, SIZE3);
+
+    node = rbcache_search_and_insert(data->cache, OFFSET4, SIZE4);
+    g_assert_nonnull(node);
+    g_assert_cmpuint(node->offset, ==, OFFSET4);
+    g_assert_cmpuint(node->bytes, ==, SIZE4);
+
+    node = rbcache_search_and_insert(data->cache, OFFSET5, SIZE5);
+    g_assert_nonnull(node);
+    g_assert_cmpuint(node->offset, ==, OFFSET2);
+    g_assert_cmpuint(node->bytes, ==, SIZE2);
+
+    node = rbcache_search_and_insert(data->cache, OFFSET6, SIZE6);
+    g_assert_nonnull(node);
+    g_assert_cmpuint(node->offset, ==, OFFSET4);
+    g_assert_cmpuint(node->bytes, ==, SIZE4);
+
+    node = rbcache_search_and_insert(data->cache, OFFSET7, SIZE7);
+    g_assert_nonnull(node);
+    g_assert_cmpuint(node->offset, ==, OFFSET3);
+    g_assert_cmpuint(node->bytes, ==, SIZE3);
+
+    node = rbcache_search_and_insert(data->cache, OFFSET8, SIZE8);
+    g_assert_nonnull(node);
+    g_assert_cmpuint(node->offset, ==, OFFSET2);
+    g_assert_cmpuint(node->bytes, ==, SIZE2);
+}
+
+static void test_rbcache_remove(TestRBCacheData *data, const void *ctx)
+{
+    RBCacheNode *node;
+
+    test_rbcache_search_and_insert(data, ctx);
+
+    node = rbcache_search(data->cache, OFFSET1, SIZE1);
+    g_assert_nonnull(node);
+    rbcache_remove(data->cache, node);
+    node = rbcache_search(data->cache, OFFSET1, SIZE1);
+    g_assert_null(node);
+
+    node = rbcache_search(data->cache, OFFSET3, SIZE3);
+    g_assert_nonnull(node);
+    rbcache_remove(data->cache, node);
+    node = rbcache_search(data->cache, OFFSET3, SIZE3);
+    g_assert_null(node);
+
+    node = rbcache_search(data->cache, OFFSET4, SIZE4);
+    g_assert_nonnull(node);
+    rbcache_remove(data->cache, node);
+    node = rbcache_search(data->cache, OFFSET4, SIZE4);
+    g_assert_null(node);
+
+    node = rbcache_search(data->cache, OFFSET2, SIZE2);
+    g_assert_nonnull(node);
+    rbcache_remove(data->cache, node);
+    node = rbcache_search(data->cache, OFFSET2, SIZE2);
+    g_assert_null(node);
+}
+
+static void test_rbcache_shrink(TestRBCacheData *data, const void *ctx)
+{
+    RBCacheNode *node;
+
+    node = rbcache_search_and_insert(data->cache, 0, MB(2));
+    g_assert_nonnull(node);
+
+    node = rbcache_search_and_insert(data->cache, MB(2), MB(3));
+    g_assert_nonnull(node);
+
+    node = rbcache_search(data->cache, 0, MB(2));
+    g_assert_null(node);
+
+    node = rbcache_search(data->cache, MB(2), MB(3));
+    g_assert_nonnull(node);
+
+    node = rbcache_search_and_insert(data->cache, 0, MB(2));
+    g_assert_nonnull(node);
+
+    node = rbcache_search(data->cache, 0, MB(2));
+    g_assert_nonnull(node);
+
+    node = rbcache_search(data->cache, MB(2), MB(3));
+    g_assert_null(node);
+}
+
+static void test_rbcache_shrink_fifo(TestRBCacheData *data, const void *ctx)
+{
+    RBCacheNode *node;
+
+    rbcache_search_and_insert(data->cache, 0, MB(1));
+    rbcache_search_and_insert(data->cache, MB(1), MB(1));
+    rbcache_search_and_insert(data->cache, MB(2), MB(1));
+    rbcache_search_and_insert(data->cache, MB(3), MB(1));
+
+    node = rbcache_search_and_insert(data->cache, MB(4), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, 0, MB(1));
+    g_assert_null(node);
+    node = rbcache_search(data->cache, MB(3), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, MB(1), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, MB(2), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, MB(4), MB(1));
+    g_assert_nonnull(node);
+
+    node = rbcache_search_and_insert(data->cache, MB(5), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, 0, MB(1));
+    g_assert_null(node);
+    node = rbcache_search(data->cache, MB(1), MB(1));
+    g_assert_null(node);
+    node = rbcache_search(data->cache, MB(3), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, MB(2), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, MB(4), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, MB(5), MB(1));
+    g_assert_nonnull(node);
+
+    node = rbcache_search_and_insert(data->cache, MB(6), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, 0, MB(1));
+    g_assert_null(node);
+    node = rbcache_search(data->cache, MB(1), MB(1));
+    g_assert_null(node);
+    node = rbcache_search(data->cache, MB(2), MB(1));
+    g_assert_null(node);
+    node = rbcache_search(data->cache, MB(3), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, MB(4), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, MB(5), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, MB(6), MB(1));
+    g_assert_nonnull(node);
+
+    node = rbcache_search_and_insert(data->cache, MB(7), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, 0, MB(1));
+    g_assert_null(node);
+    node = rbcache_search(data->cache, MB(1), MB(1));
+    g_assert_null(node);
+    node = rbcache_search(data->cache, MB(2), MB(1));
+    g_assert_null(node);
+    node = rbcache_search(data->cache, MB(3), MB(1));
+    g_assert_null(node);
+    node = rbcache_search(data->cache, MB(4), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, MB(5), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, MB(6), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, MB(7), MB(1));
+    g_assert_nonnull(node);
+}
+
+static void test_rbcache_shrink_lru(TestRBCacheData *data, const void *ctx)
+{
+    RBCacheNode *node;
+
+    rbcache_search_and_insert(data->cache, 0, MB(1));
+    rbcache_search_and_insert(data->cache, MB(1), MB(1));
+    rbcache_search_and_insert(data->cache, MB(2), MB(1));
+    rbcache_search_and_insert(data->cache, MB(3), MB(1));
+
+    node = rbcache_search_and_insert(data->cache, MB(4), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, 0, MB(1));
+    g_assert_null(node);
+    node = rbcache_search(data->cache, MB(3), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, MB(1), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, MB(2), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, MB(4), MB(1));
+    g_assert_nonnull(node);
+
+    node = rbcache_search_and_insert(data->cache, MB(5), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, 0, MB(1));
+    g_assert_null(node);
+    node = rbcache_search(data->cache, MB(3), MB(1));
+    g_assert_null(node);
+    node = rbcache_search(data->cache, MB(1), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, MB(2), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, MB(4), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, MB(5), MB(1));
+    g_assert_nonnull(node);
+
+    node = rbcache_search_and_insert(data->cache, MB(6), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, 0, MB(1));
+    g_assert_null(node);
+    node = rbcache_search(data->cache, MB(3), MB(1));
+    g_assert_null(node);
+    node = rbcache_search(data->cache, MB(1), MB(1));
+    g_assert_null(node);
+    node = rbcache_search(data->cache, MB(2), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, MB(4), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, MB(5), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, MB(6), MB(1));
+    g_assert_nonnull(node);
+
+    node = rbcache_search_and_insert(data->cache, MB(7), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, 0, MB(1));
+    g_assert_null(node);
+    node = rbcache_search(data->cache, MB(3), MB(1));
+    g_assert_null(node);
+    node = rbcache_search(data->cache, MB(1), MB(1));
+    g_assert_null(node);
+    node = rbcache_search(data->cache, MB(2), MB(1));
+    g_assert_null(node);
+    node = rbcache_search(data->cache, MB(4), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, MB(5), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, MB(6), MB(1));
+    g_assert_nonnull(node);
+    node = rbcache_search(data->cache, MB(7), MB(1));
+    g_assert_nonnull(node);
+}
+
+static void test_rbcache_setup(TestRBCacheData *data, const void *ctx)
+{
+    const TestRBCacheConfig *config = ctx;
+
+    data->cache =
+        rbcache_create(config->alloc, config->free, config->limit_size,
+                       config->eviction_type, config->opaque);
+}
+
+static void test_rbcache_teardown(TestRBCacheData *data, const void *ctx)
+{
+    rbcache_destroy(data->cache);
+}
+
+static void rbcache_test_add(const char *testpath,
+                             void (*test_func)(TestRBCacheData *data,
+                             const void *user_data), void *ctx)
+{
+    g_test_add(testpath, TestRBCacheData, ctx, test_rbcache_setup, test_func,
+               test_rbcache_teardown);
+}
+
+int main(int argc, char **argv)
+{
+    TestRBCacheConfig config = {
+        .limit_size = MB(4),
+        .eviction_type = RBCACHE_FIFO,
+    };
+    TestRBCacheConfig config_lru = {
+        .limit_size = MB(4),
+        .eviction_type = RBCACHE_LRU,
+    };
+
+    g_test_init(&argc, &argv, NULL);
+
+    rbcache_test_add("/rbcache/init", test_rbcache_init, &config);
+    rbcache_test_add("/rbcache/insert", test_rbcache_insert, &config);
+    rbcache_test_add("/rbcache/search", test_rbcache_search, &config);
+    rbcache_test_add("/rbcache/search_and_insert",
+                     test_rbcache_search_and_insert, &config);
+    rbcache_test_add("/rbcache/rbcache_remove", test_rbcache_remove, &config);
+    rbcache_test_add("/rbcache/shrink", test_rbcache_shrink, &config);
+    rbcache_test_add("/rbcache/shrink/fifo", test_rbcache_shrink_fifo, &config);
+    rbcache_test_add("/rbcache/shrink/lru", test_rbcache_shrink_lru,
+                     &config_lru);
+
+    g_test_run();
+
+    return 0;
+}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v2 05/18] block/pcache: statistics collection read requests
  2016-12-30 14:31 [Qemu-devel] [PATCH v2 00/18] I/O prefetch cache Pavel Butsykin
                   ` (3 preceding siblings ...)
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 04/18] tests/test-rbcache: add test cases Pavel Butsykin
@ 2016-12-30 14:31 ` Pavel Butsykin
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 06/18] block/pcache: skip large aio read Pavel Butsykin
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Pavel Butsykin @ 2016-12-30 14:31 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: kwolf, mreitz, den, eblake, armbru

Here the rbcache is used as a repository statistics requests, in fact, data
requests are not cached, so we have the ability to store a large  number of
requests. We need statistics requests to determine the sequential requests.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/pcache.c | 32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/block/pcache.c b/block/pcache.c
index 7a67618408..296ae382b0 100644
--- a/block/pcache.c
+++ b/block/pcache.c
@@ -13,20 +13,38 @@
 #include "block/block_int.h"
 #include "qapi/error.h"
 #include "qapi/qmp/qstring.h"
+#include "qemu/rbcache.h"
 
+#define PCACHE_OPT_STATS_SIZE "pcache-stats-size"
 
 static QemuOptsList runtime_opts = {
     .name = "pcache",
     .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head),
     .desc = {
+        {
+            .name = PCACHE_OPT_STATS_SIZE,
+            .type = QEMU_OPT_SIZE,
+            .help = "Total volume of requests for statistics",
+        },
         { /* end of list */ }
     },
 };
 
+#define MB_BITS 20
+#define PCACHE_DEFAULT_STATS_SIZE (3 << MB_BITS)
+
+typedef struct BDRVPCacheState {
+    RBCache *req_stats;
+} BDRVPCacheState;
+
 static coroutine_fn int pcache_co_preadv(BlockDriverState *bs, uint64_t offset,
                                          uint64_t bytes, QEMUIOVector *qiov,
                                          int flags)
 {
+    BDRVPCacheState *s = bs->opaque;
+
+    rbcache_search_and_insert(s->req_stats, offset, bytes);
+
     return bdrv_co_preadv(bs->file, offset, bytes, qiov, flags);
 }
 
@@ -37,6 +55,13 @@ static coroutine_fn int pcache_co_pwritev(BlockDriverState *bs, uint64_t offset,
     return bdrv_co_pwritev(bs->file, offset, bytes, qiov, flags);
 }
 
+static void pcache_state_init(QemuOpts *opts, BDRVPCacheState *s)
+{
+    uint64_t stats_size = qemu_opt_get_size(opts, PCACHE_OPT_STATS_SIZE,
+                                            PCACHE_DEFAULT_STATS_SIZE);
+    s->req_stats = rbcache_create(NULL, NULL, stats_size, RBCACHE_FIFO, s);
+}
+
 static int pcache_file_open(BlockDriverState *bs, QDict *options, int flags,
                             Error **errp)
 {
@@ -58,7 +83,9 @@ static int pcache_file_open(BlockDriverState *bs, QDict *options, int flags,
     if (local_err) {
         ret = -EINVAL;
         error_propagate(errp, local_err);
+        goto fail;
     }
+    pcache_state_init(opts, bs->opaque);
 fail:
     qemu_opts_del(opts);
     return ret;
@@ -66,6 +93,9 @@ fail:
 
 static void pcache_close(BlockDriverState *bs)
 {
+    BDRVPCacheState *s = bs->opaque;
+
+    rbcache_destroy(s->req_stats);
 }
 
 static int64_t pcache_getlength(BlockDriverState *bs)
@@ -81,7 +111,7 @@ static bool pcache_recurse_is_first_non_filter(BlockDriverState *bs,
 
 static BlockDriver bdrv_pcache = {
     .format_name                        = "pcache",
-    .instance_size                      = 0,
+    .instance_size                      = sizeof(BDRVPCacheState),
 
     .bdrv_file_open                     = pcache_file_open,
     .bdrv_close                         = pcache_close,
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v2 06/18] block/pcache: skip large aio read
  2016-12-30 14:31 [Qemu-devel] [PATCH v2 00/18] I/O prefetch cache Pavel Butsykin
                   ` (4 preceding siblings ...)
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 05/18] block/pcache: statistics collection read requests Pavel Butsykin
@ 2016-12-30 14:31 ` Pavel Butsykin
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 07/18] block/pcache: updating statistics for overlapping requests Pavel Butsykin
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Pavel Butsykin @ 2016-12-30 14:31 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: kwolf, mreitz, den, eblake, armbru

This change will allow more efficient use of cache memory and filter the case
for which the pcache isn't efficient.  We skip requests that are not required in
the optimization and thereby reducing the number of unnecessary readaheads.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/pcache.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/block/pcache.c b/block/pcache.c
index 296ae382b0..1f3200af63 100644
--- a/block/pcache.c
+++ b/block/pcache.c
@@ -16,6 +16,7 @@
 #include "qemu/rbcache.h"
 
 #define PCACHE_OPT_STATS_SIZE "pcache-stats-size"
+#define PCACHE_OPT_MAX_AIO_SIZE "pcache-max-aio-size"
 
 static QemuOptsList runtime_opts = {
     .name = "pcache",
@@ -26,15 +27,23 @@ static QemuOptsList runtime_opts = {
             .type = QEMU_OPT_SIZE,
             .help = "Total volume of requests for statistics",
         },
+        {
+            .name = PCACHE_OPT_MAX_AIO_SIZE,
+            .type = QEMU_OPT_SIZE,
+            .help = "Maximum size of aio which is handled by pcache",
+        },
         { /* end of list */ }
     },
 };
 
+#define KB_BITS 10
 #define MB_BITS 20
 #define PCACHE_DEFAULT_STATS_SIZE (3 << MB_BITS)
+#define PCACHE_DEFAULT_MAX_AIO_SIZE (64 << KB_BITS)
 
 typedef struct BDRVPCacheState {
     RBCache *req_stats;
+    uint64_t max_aio_size;
 } BDRVPCacheState;
 
 static coroutine_fn int pcache_co_preadv(BlockDriverState *bs, uint64_t offset,
@@ -43,7 +52,9 @@ static coroutine_fn int pcache_co_preadv(BlockDriverState *bs, uint64_t offset,
 {
     BDRVPCacheState *s = bs->opaque;
 
-    rbcache_search_and_insert(s->req_stats, offset, bytes);
+    if (s->max_aio_size >= bytes) {
+        rbcache_search_and_insert(s->req_stats, offset, bytes);
+    }
 
     return bdrv_co_preadv(bs->file, offset, bytes, qiov, flags);
 }
@@ -60,6 +71,9 @@ static void pcache_state_init(QemuOpts *opts, BDRVPCacheState *s)
     uint64_t stats_size = qemu_opt_get_size(opts, PCACHE_OPT_STATS_SIZE,
                                             PCACHE_DEFAULT_STATS_SIZE);
     s->req_stats = rbcache_create(NULL, NULL, stats_size, RBCACHE_FIFO, s);
+
+    s->max_aio_size = qemu_opt_get_size(opts, PCACHE_OPT_MAX_AIO_SIZE,
+                                        PCACHE_DEFAULT_MAX_AIO_SIZE);
 }
 
 static int pcache_file_open(BlockDriverState *bs, QDict *options, int flags,
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v2 07/18] block/pcache: updating statistics for overlapping requests
  2016-12-30 14:31 [Qemu-devel] [PATCH v2 00/18] I/O prefetch cache Pavel Butsykin
                   ` (5 preceding siblings ...)
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 06/18] block/pcache: skip large aio read Pavel Butsykin
@ 2016-12-30 14:31 ` Pavel Butsykin
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 08/18] block/pcache: add AIO readahead Pavel Butsykin
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Pavel Butsykin @ 2016-12-30 14:31 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: kwolf, mreitz, den, eblake, armbru

When updating the statistics sometimes i/O requests can overlap each other,
in this case the requests are not stored in the statistics. It's not very good,
especially when the requests have a small range of intersection.

We can cut the requests in the intersection and add the pieces of requests in
the statistics. Maybe not significantly, but it can make the statistics more
accurate. Here, update_req_stats() adds the ability to save overlapping
requests.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/pcache.c | 36 +++++++++++++++++++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/block/pcache.c b/block/pcache.c
index 1f3200af63..deac57c58d 100644
--- a/block/pcache.c
+++ b/block/pcache.c
@@ -46,6 +46,40 @@ typedef struct BDRVPCacheState {
     uint64_t max_aio_size;
 } BDRVPCacheState;
 
+static void update_req_stats(RBCache *rbcache, uint64_t offset, uint64_t bytes)
+{
+    do {
+        RBCacheNode *node = rbcache_search_and_insert(rbcache, offset, bytes);
+        /* The node was successfully added or already exists */
+        if (node->offset <= offset &&
+            node->offset + node->bytes >= offset + bytes)
+        {
+            break;
+        }
+
+        /* Request covers the whole node */
+        if (offset <= node->offset &&
+            offset + bytes >= node->offset + node->bytes)
+        {
+            rbcache_remove(rbcache, node);
+            continue;
+        }
+
+        if (offset < node->offset) {
+            RBCacheNode *new_node =
+                rbcache_node_alloc(rbcache, offset, node->offset - offset);
+            if (new_node != rbcache_insert(rbcache, new_node)) {
+                /* The presence of the node in this range is impossible */
+                abort();
+            }
+            break;
+        }
+
+        bytes = (offset + bytes) - (node->offset + node->bytes);
+        offset = node->offset + node->bytes;
+    } while (true);
+}
+
 static coroutine_fn int pcache_co_preadv(BlockDriverState *bs, uint64_t offset,
                                          uint64_t bytes, QEMUIOVector *qiov,
                                          int flags)
@@ -53,7 +87,7 @@ static coroutine_fn int pcache_co_preadv(BlockDriverState *bs, uint64_t offset,
     BDRVPCacheState *s = bs->opaque;
 
     if (s->max_aio_size >= bytes) {
-        rbcache_search_and_insert(s->req_stats, offset, bytes);
+        update_req_stats(s->req_stats, offset, bytes);
     }
 
     return bdrv_co_preadv(bs->file, offset, bytes, qiov, flags);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v2 08/18] block/pcache: add AIO readahead
  2016-12-30 14:31 [Qemu-devel] [PATCH v2 00/18] I/O prefetch cache Pavel Butsykin
                   ` (6 preceding siblings ...)
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 07/18] block/pcache: updating statistics for overlapping requests Pavel Butsykin
@ 2016-12-30 14:31 ` Pavel Butsykin
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 09/18] block/pcache: skip readahead for unallocated clusters Pavel Butsykin
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Pavel Butsykin @ 2016-12-30 14:31 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: kwolf, mreitz, den, eblake, armbru

This patch adds readahead data to the cache. Here the readahead is a separate
asynchronous request, which doesn't depend on completion of filtered read
requests. The readahead is done only by the condition, if before the current
request there's sequential read data enough size. This information can give
the request statistics, of course this method of detection is not very reliable,
but in most cases it'll work.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/pcache.c | 206 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 204 insertions(+), 2 deletions(-)

diff --git a/block/pcache.c b/block/pcache.c
index deac57c58d..57eebd434a 100644
--- a/block/pcache.c
+++ b/block/pcache.c
@@ -17,6 +17,8 @@
 
 #define PCACHE_OPT_STATS_SIZE "pcache-stats-size"
 #define PCACHE_OPT_MAX_AIO_SIZE "pcache-max-aio-size"
+#define PCACHE_OPT_CACHE_SIZE "pcache-full-size"
+#define PCACHE_OPT_READAHEAD_SIZE "pcache-readahead-size"
 
 static QemuOptsList runtime_opts = {
     .name = "pcache",
@@ -32,6 +34,16 @@ static QemuOptsList runtime_opts = {
             .type = QEMU_OPT_SIZE,
             .help = "Maximum size of aio which is handled by pcache",
         },
+        {
+            .name = PCACHE_OPT_CACHE_SIZE,
+            .type = QEMU_OPT_SIZE,
+            .help = "Total cache size",
+        },
+        {
+            .name = PCACHE_OPT_READAHEAD_SIZE,
+            .type = QEMU_OPT_SIZE,
+            .help = "Prefetch cache readahead size",
+        },
         { /* end of list */ }
     },
 };
@@ -40,12 +52,46 @@ static QemuOptsList runtime_opts = {
 #define MB_BITS 20
 #define PCACHE_DEFAULT_STATS_SIZE (3 << MB_BITS)
 #define PCACHE_DEFAULT_MAX_AIO_SIZE (64 << KB_BITS)
+#define PCACHE_DEFAULT_CACHE_SIZE (4 << MB_BITS)
+#define PCACHE_DEFAULT_READAHEAD_SIZE (128 << KB_BITS)
 
 typedef struct BDRVPCacheState {
     RBCache *req_stats;
+    RBCache *cache;
     uint64_t max_aio_size;
+    uint64_t readahead_size;
 } BDRVPCacheState;
 
+typedef struct PCacheNode {
+    RBCacheNode common;
+    uint8_t *data;
+    enum {
+        NODE_STATUS_NEW       = 0x01,
+        NODE_STATUS_INFLIGHT  = 0x02,
+        NODE_STATUS_COMPLETED = 0x04,
+        NODE_STATUS_REMOVE    = 0x08,
+        NODE_STATUS_DELETED   = 0x10, /* only for debugging */
+    } status;
+    int ref;
+} PCacheNode;
+
+static inline void pcache_node_ref(PCacheNode *node)
+{
+    node->ref++;
+}
+
+static void pcache_node_unref(PCacheNode *node)
+{
+    assert(node->ref > 0);
+    if (--node->ref == 0) {
+        assert(node->status & NODE_STATUS_REMOVE);
+        node->status |= NODE_STATUS_DELETED;
+
+        g_free(node->data);
+        g_free(node);
+    }
+}
+
 static void update_req_stats(RBCache *rbcache, uint64_t offset, uint64_t bytes)
 {
     do {
@@ -80,16 +126,165 @@ static void update_req_stats(RBCache *rbcache, uint64_t offset, uint64_t bytes)
     } while (true);
 }
 
+static bool check_request_sequence(BDRVPCacheState *s, uint64_t offset)
+{
+    uint64_t cache_line_size = s->readahead_size;
+    uint64_t check_offset;
+
+    if (offset <= cache_line_size) {
+        return false;
+    }
+    check_offset = offset - cache_line_size;
+
+    do {
+        RBCacheNode *node = rbcache_search(s->req_stats, check_offset,
+                                           offset - check_offset);
+        if (node == NULL) {
+            return false;
+        }
+        if (node->offset > check_offset) {
+            return false;
+        }
+        check_offset = node->offset + node->bytes;
+    } while (check_offset < offset);
+
+    return true;
+}
+
+static void pcache_node_free(RBCacheNode *rbnode, void *opaque)
+{
+    PCacheNode *node = container_of(rbnode, PCacheNode, common);
+
+    assert(node->status == NODE_STATUS_INFLIGHT ||
+           node->status == NODE_STATUS_COMPLETED);
+
+    node->status |= NODE_STATUS_REMOVE;
+    pcache_node_unref(node);
+}
+
+static RBCacheNode *pcache_node_alloc(uint64_t offset, uint64_t bytes,
+                                      void *opaque)
+{
+    PCacheNode *node = g_malloc(sizeof(*node));
+
+    node->data = g_malloc(bytes);
+    node->status = NODE_STATUS_NEW;
+    node->ref = 1;
+
+    return &node->common;
+}
+
+#define PCACHE_STEPS_FORWARD 2
+
+static PCacheNode *get_readahead_node(BlockDriverState *bs, RBCache *rbcache,
+                                      uint64_t offset, uint64_t bytes)
+{
+    uint32_t count = PCACHE_STEPS_FORWARD;
+
+    int64_t total_bytes = bdrv_getlength(bs);
+    if (total_bytes < 0) {
+        return NULL;
+    }
+
+    while (count--) {
+        PCacheNode *node;
+
+        if (total_bytes <= offset + bytes) {
+            break;
+        }
+
+        node = rbcache_search_and_insert(rbcache, offset, bytes);
+        if (node->status == NODE_STATUS_NEW) {
+            return node;
+        }
+         /* The range less than the readahead size is not cached to reduce
+          * fragmentation of the cache. If the data is already cached, then we
+          * just step over it.
+          */
+        if (offset <= node->common.offset && !count--) {
+            break;
+        }
+        offset = node->common.offset + node->common.bytes;
+    };
+
+    return NULL;
+}
+
+typedef struct PCacheReadaheadCo {
+    BlockDriverState *bs;
+    int64_t offset;
+    uint64_t bytes;
+} PCacheReadaheadCo;
+
+static void coroutine_fn pcache_co_readahead(BlockDriverState *bs,
+                                             uint64_t offset, uint64_t bytes)
+{
+    BDRVPCacheState *s = bs->opaque;
+    QEMUIOVector qiov;
+    PCacheNode *node;
+    uint64_t readahead_offset;
+    uint64_t readahead_bytes;
+    int ret;
+
+    if (!check_request_sequence(s, offset)) {
+        return;
+    }
+
+    readahead_offset = offset + bytes;
+    readahead_bytes = s->readahead_size;
+
+    node = get_readahead_node(bs, s->cache, readahead_offset, readahead_bytes);
+    if (node == NULL) {
+        return;
+    }
+    node->status = NODE_STATUS_INFLIGHT;
+    qemu_iovec_init(&qiov, 1);
+    qemu_iovec_add(&qiov, node->data, node->common.bytes);
+    pcache_node_ref(node);
+
+    ret = bdrv_co_preadv(bs->file, node->common.offset,
+                         node->common.bytes, &qiov, 0);
+    assert(node->status & NODE_STATUS_INFLIGHT);
+    node->status &= ~NODE_STATUS_INFLIGHT;
+    node->status |= NODE_STATUS_COMPLETED;
+
+    if (ret < 0) {
+        rbcache_remove(s->cache, &node->common);
+    }
+    pcache_node_unref(node);
+}
+
+static void pcache_readahead_entry(void *opaque)
+{
+    PCacheReadaheadCo *readahead_co = opaque;
+
+    pcache_co_readahead(readahead_co->bs, readahead_co->offset,
+                        readahead_co->bytes);
+}
+
 static coroutine_fn int pcache_co_preadv(BlockDriverState *bs, uint64_t offset,
                                          uint64_t bytes, QEMUIOVector *qiov,
                                          int flags)
 {
     BDRVPCacheState *s = bs->opaque;
+    PCacheReadaheadCo readahead_co;
+    Coroutine *co;
 
-    if (s->max_aio_size >= bytes) {
-        update_req_stats(s->req_stats, offset, bytes);
+    if (bytes > s->max_aio_size) {
+        goto skip_large_request;
     }
 
+    update_req_stats(s->req_stats, offset, bytes);
+
+    readahead_co = (PCacheReadaheadCo) {
+        .bs = bs,
+        .offset = offset,
+        .bytes = bytes,
+    };
+    co = qemu_coroutine_create(pcache_readahead_entry, &readahead_co);
+    qemu_coroutine_enter(co);
+
+skip_large_request:
     return bdrv_co_preadv(bs->file, offset, bytes, qiov, flags);
 }
 
@@ -104,10 +299,16 @@ static void pcache_state_init(QemuOpts *opts, BDRVPCacheState *s)
 {
     uint64_t stats_size = qemu_opt_get_size(opts, PCACHE_OPT_STATS_SIZE,
                                             PCACHE_DEFAULT_STATS_SIZE);
+    uint64_t cache_size = qemu_opt_get_size(opts, PCACHE_OPT_CACHE_SIZE,
+                                            PCACHE_DEFAULT_CACHE_SIZE);
     s->req_stats = rbcache_create(NULL, NULL, stats_size, RBCACHE_FIFO, s);
 
     s->max_aio_size = qemu_opt_get_size(opts, PCACHE_OPT_MAX_AIO_SIZE,
                                         PCACHE_DEFAULT_MAX_AIO_SIZE);
+    s->cache = rbcache_create(pcache_node_alloc, pcache_node_free, cache_size,
+                              RBCACHE_LRU, s);
+    s->readahead_size = qemu_opt_get_size(opts, PCACHE_OPT_READAHEAD_SIZE,
+                                          PCACHE_DEFAULT_READAHEAD_SIZE);
 }
 
 static int pcache_file_open(BlockDriverState *bs, QDict *options, int flags,
@@ -144,6 +345,7 @@ static void pcache_close(BlockDriverState *bs)
     BDRVPCacheState *s = bs->opaque;
 
     rbcache_destroy(s->req_stats);
+    rbcache_destroy(s->cache);
 }
 
 static int64_t pcache_getlength(BlockDriverState *bs)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v2 09/18] block/pcache: skip readahead for unallocated clusters
  2016-12-30 14:31 [Qemu-devel] [PATCH v2 00/18] I/O prefetch cache Pavel Butsykin
                   ` (7 preceding siblings ...)
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 08/18] block/pcache: add AIO readahead Pavel Butsykin
@ 2016-12-30 14:31 ` Pavel Butsykin
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 10/18] block/pcache: cache invalidation on write requests Pavel Butsykin
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Pavel Butsykin @ 2016-12-30 14:31 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: kwolf, mreitz, den, eblake, armbru

Typically, data for unallocated clusters is filled with zeros, so it makes no
sense to store it in the cache.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/pcache.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/block/pcache.c b/block/pcache.c
index 57eebd434a..f30e9e7bfe 100644
--- a/block/pcache.c
+++ b/block/pcache.c
@@ -174,6 +174,28 @@ static RBCacheNode *pcache_node_alloc(uint64_t offset, uint64_t bytes,
     return &node->common;
 }
 
+static bool check_allocated_clusters(BlockDriverState *bs, uint64_t offset,
+                                     uint64_t bytes)
+{
+    int64_t sector_num = offset >> BDRV_SECTOR_BITS;
+    int32_t nb_sectors = bytes >> BDRV_SECTOR_BITS;
+
+    assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
+    assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
+
+    do {
+        int num, ret = bdrv_is_allocated(bs, sector_num, nb_sectors, &num);
+        if (ret <= 0) {
+            return false;
+        }
+        sector_num += num;
+        nb_sectors -= num;
+
+    } while (nb_sectors);
+
+    return true;
+}
+
 #define PCACHE_STEPS_FORWARD 2
 
 static PCacheNode *get_readahead_node(BlockDriverState *bs, RBCache *rbcache,
@@ -193,6 +215,10 @@ static PCacheNode *get_readahead_node(BlockDriverState *bs, RBCache *rbcache,
             break;
         }
 
+        if (!check_allocated_clusters(bs, offset, bytes)) {
+            break;
+        }
+
         node = rbcache_search_and_insert(rbcache, offset, bytes);
         if (node->status == NODE_STATUS_NEW) {
             return node;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v2 10/18] block/pcache: cache invalidation on write requests
  2016-12-30 14:31 [Qemu-devel] [PATCH v2 00/18] I/O prefetch cache Pavel Butsykin
                   ` (8 preceding siblings ...)
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 09/18] block/pcache: skip readahead for unallocated clusters Pavel Butsykin
@ 2016-12-30 14:31 ` Pavel Butsykin
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 11/18] block/pcache: add reading data from the cache Pavel Butsykin
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Pavel Butsykin @ 2016-12-30 14:31 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: kwolf, mreitz, den, eblake, armbru

In AIO write request completion we just drop all the intersecting nodes in the
cache, it's a simple way to keep the cache up-to-date.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/pcache.c | 32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/block/pcache.c b/block/pcache.c
index f30e9e7bfe..4007241d37 100644
--- a/block/pcache.c
+++ b/block/pcache.c
@@ -314,11 +314,41 @@ skip_large_request:
     return bdrv_co_preadv(bs->file, offset, bytes, qiov, flags);
 }
 
+static void pcache_invalidation(BlockDriverState *bs, uint64_t offset,
+                                uint64_t bytes)
+{
+    BDRVPCacheState *s = bs->opaque;
+    uint64_t chunk_offset = offset, chunk_bytes = bytes;
+    uint64_t end_offs = offset + bytes;
+
+    do {
+        PCacheNode *node = rbcache_search(s->cache, chunk_offset, chunk_bytes);
+        if (node == NULL) {
+            break;
+        }
+        assert(node->status == NODE_STATUS_COMPLETED ||
+               node->status == NODE_STATUS_INFLIGHT);
+
+        chunk_offset = node->common.offset + node->common.bytes;
+        chunk_bytes = end_offs - chunk_offset;
+
+        if (node->status & NODE_STATUS_COMPLETED) {
+            rbcache_remove(s->cache, &node->common);
+        }
+    } while (end_offs > chunk_offset);
+}
+
 static coroutine_fn int pcache_co_pwritev(BlockDriverState *bs, uint64_t offset,
                                           uint64_t bytes, QEMUIOVector *qiov,
                                           int flags)
 {
-    return bdrv_co_pwritev(bs->file, offset, bytes, qiov, flags);
+    int ret = bdrv_co_pwritev(bs->file, offset, bytes, qiov, flags);
+    if (ret < 0) {
+        return ret;
+    }
+    pcache_invalidation(bs, offset, bytes);
+
+    return ret;
 }
 
 static void pcache_state_init(QemuOpts *opts, BDRVPCacheState *s)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v2 11/18] block/pcache: add reading data from the cache
  2016-12-30 14:31 [Qemu-devel] [PATCH v2 00/18] I/O prefetch cache Pavel Butsykin
                   ` (9 preceding siblings ...)
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 10/18] block/pcache: cache invalidation on write requests Pavel Butsykin
@ 2016-12-30 14:31 ` Pavel Butsykin
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 12/18] block/pcache: inflight readahead request waiting for read Pavel Butsykin
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Pavel Butsykin @ 2016-12-30 14:31 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: kwolf, mreitz, den, eblake, armbru

Added read_cache_data_direct() allowing to read data from node to qiov.
And the simplest use case - read data from node provided that the node is
not in flight and fully covers read request.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/pcache.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/block/pcache.c b/block/pcache.c
index 4007241d37..e3749679ca 100644
--- a/block/pcache.c
+++ b/block/pcache.c
@@ -92,6 +92,30 @@ static void pcache_node_unref(PCacheNode *node)
     }
 }
 
+static uint64_t ranges_overlap_size(uint64_t offset1, uint64_t size1,
+                                    uint64_t offset2, uint32_t size2)
+{
+    return MIN(offset1 + size1, offset2 + size2) - MAX(offset1, offset2);
+}
+
+static void read_cache_data_direct(PCacheNode *node, uint64_t offset,
+                                   uint64_t bytes, QEMUIOVector *qiov)
+{
+    uint64_t qiov_offs = 0, node_offs = 0;
+    uint64_t size;
+    uint64_t copy;
+
+    if (offset < node->common.offset) {
+        qiov_offs = node->common.offset - offset;
+    } else {
+        node_offs = offset - node->common.offset;
+    }
+    size = ranges_overlap_size(offset, bytes, node->common.offset,
+                               node->common.bytes);
+    copy = qemu_iovec_from_buf(qiov, qiov_offs, node->data + node_offs, size);
+    assert(copy == size);
+}
+
 static void update_req_stats(RBCache *rbcache, uint64_t offset, uint64_t bytes)
 {
     do {
@@ -288,6 +312,32 @@ static void pcache_readahead_entry(void *opaque)
                         readahead_co->bytes);
 }
 
+enum {
+    CACHE_MISS,
+    CACHE_HIT,
+};
+
+static int pcache_lookup_data(BlockDriverState *bs, uint64_t offset,
+                              uint64_t bytes, QEMUIOVector *qiov)
+{
+    BDRVPCacheState *s = bs->opaque;
+
+    PCacheNode *node = rbcache_search(s->cache, offset, bytes);
+    if (node == NULL || node->status & NODE_STATUS_INFLIGHT) {
+        return CACHE_MISS;
+    }
+
+    /* Node covers the whole request */
+    if (node->common.offset <= offset &&
+        node->common.offset + node->common.bytes >= offset + bytes)
+    {
+        read_cache_data_direct(node, offset, bytes, qiov);
+        return CACHE_HIT;
+    }
+
+    return CACHE_MISS;
+}
+
 static coroutine_fn int pcache_co_preadv(BlockDriverState *bs, uint64_t offset,
                                          uint64_t bytes, QEMUIOVector *qiov,
                                          int flags)
@@ -295,6 +345,7 @@ static coroutine_fn int pcache_co_preadv(BlockDriverState *bs, uint64_t offset,
     BDRVPCacheState *s = bs->opaque;
     PCacheReadaheadCo readahead_co;
     Coroutine *co;
+    int status;
 
     if (bytes > s->max_aio_size) {
         goto skip_large_request;
@@ -310,6 +361,11 @@ static coroutine_fn int pcache_co_preadv(BlockDriverState *bs, uint64_t offset,
     co = qemu_coroutine_create(pcache_readahead_entry, &readahead_co);
     qemu_coroutine_enter(co);
 
+    status = pcache_lookup_data(bs, offset, bytes, qiov);
+    if (status == CACHE_HIT) {
+        return 0;
+    }
+
 skip_large_request:
     return bdrv_co_preadv(bs->file, offset, bytes, qiov, flags);
 }
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v2 12/18] block/pcache: inflight readahead request waiting for read
  2016-12-30 14:31 [Qemu-devel] [PATCH v2 00/18] I/O prefetch cache Pavel Butsykin
                   ` (10 preceding siblings ...)
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 11/18] block/pcache: add reading data from the cache Pavel Butsykin
@ 2016-12-30 14:31 ` Pavel Butsykin
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 13/18] block/pcache: write through Pavel Butsykin
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Pavel Butsykin @ 2016-12-30 14:31 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: kwolf, mreitz, den, eblake, armbru

If there was a cache hit, but the status of the node is NODE_STATUS_INFLIGHT,
then this means that a readahead request for this node is in flight. In this case,
we can wait for the completion of the readahead request, and then copy the
data. It allows us even more to optimize aio read requests.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/pcache.c | 48 ++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 44 insertions(+), 4 deletions(-)

diff --git a/block/pcache.c b/block/pcache.c
index e3749679ca..c1cbfa7040 100644
--- a/block/pcache.c
+++ b/block/pcache.c
@@ -62,9 +62,16 @@ typedef struct BDRVPCacheState {
     uint64_t readahead_size;
 } BDRVPCacheState;
 
+typedef struct ReqLinkEntry {
+    QTAILQ_ENTRY(ReqLinkEntry) entry;
+    Coroutine *co;
+    int ret;
+} ReqLinkEntry;
+
 typedef struct PCacheNode {
     RBCacheNode common;
     uint8_t *data;
+    QTAILQ_HEAD(, ReqLinkEntry) wait_list;
     enum {
         NODE_STATUS_NEW       = 0x01,
         NODE_STATUS_INFLIGHT  = 0x02,
@@ -116,6 +123,27 @@ static void read_cache_data_direct(PCacheNode *node, uint64_t offset,
     assert(copy == size);
 }
 
+static int read_cache_data(PCacheNode *node, uint64_t offset, uint64_t bytes,
+                           QEMUIOVector *qiov)
+{
+    if (node->status & NODE_STATUS_INFLIGHT) {
+        ReqLinkEntry rlink = {
+            .co = qemu_coroutine_self(),
+        };
+
+        QTAILQ_INSERT_HEAD(&node->wait_list, &rlink, entry);
+
+        qemu_coroutine_yield();
+
+        if (rlink.ret < 0) {
+            return rlink.ret;
+        }
+    }
+    read_cache_data_direct(node, offset, bytes, qiov);
+
+    return 0;
+}
+
 static void update_req_stats(RBCache *rbcache, uint64_t offset, uint64_t bytes)
 {
     do {
@@ -194,6 +222,7 @@ static RBCacheNode *pcache_node_alloc(uint64_t offset, uint64_t bytes,
     node->data = g_malloc(bytes);
     node->status = NODE_STATUS_NEW;
     node->ref = 1;
+    QTAILQ_INIT(&node->wait_list);
 
     return &node->common;
 }
@@ -272,6 +301,7 @@ static void coroutine_fn pcache_co_readahead(BlockDriverState *bs,
     BDRVPCacheState *s = bs->opaque;
     QEMUIOVector qiov;
     PCacheNode *node;
+    ReqLinkEntry *rlink, *next;
     uint64_t readahead_offset;
     uint64_t readahead_bytes;
     int ret;
@@ -301,6 +331,13 @@ static void coroutine_fn pcache_co_readahead(BlockDriverState *bs,
     if (ret < 0) {
         rbcache_remove(s->cache, &node->common);
     }
+
+    QTAILQ_FOREACH_SAFE(rlink, &node->wait_list, entry, next) {
+        QTAILQ_REMOVE(&node->wait_list, rlink, entry);
+        rlink->ret = ret;
+        qemu_coroutine_enter(rlink->co);
+    }
+
     pcache_node_unref(node);
 }
 
@@ -323,7 +360,7 @@ static int pcache_lookup_data(BlockDriverState *bs, uint64_t offset,
     BDRVPCacheState *s = bs->opaque;
 
     PCacheNode *node = rbcache_search(s->cache, offset, bytes);
-    if (node == NULL || node->status & NODE_STATUS_INFLIGHT) {
+    if (node == NULL) {
         return CACHE_MISS;
     }
 
@@ -331,7 +368,10 @@ static int pcache_lookup_data(BlockDriverState *bs, uint64_t offset,
     if (node->common.offset <= offset &&
         node->common.offset + node->common.bytes >= offset + bytes)
     {
-        read_cache_data_direct(node, offset, bytes, qiov);
+        int ret = read_cache_data(node, offset, bytes, qiov);
+        if (ret < 0) {
+            return ret;
+        }
         return CACHE_HIT;
     }
 
@@ -362,8 +402,8 @@ static coroutine_fn int pcache_co_preadv(BlockDriverState *bs, uint64_t offset,
     qemu_coroutine_enter(co);
 
     status = pcache_lookup_data(bs, offset, bytes, qiov);
-    if (status == CACHE_HIT) {
-        return 0;
+    if (status != CACHE_MISS) {
+        return status < 0 ? status : 0;
     }
 
 skip_large_request:
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v2 13/18] block/pcache: write through
  2016-12-30 14:31 [Qemu-devel] [PATCH v2 00/18] I/O prefetch cache Pavel Butsykin
                   ` (11 preceding siblings ...)
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 12/18] block/pcache: inflight readahead request waiting for read Pavel Butsykin
@ 2016-12-30 14:31 ` Pavel Butsykin
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 14/18] block/pcache: up-to-date cache for removed nodes Pavel Butsykin
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Pavel Butsykin @ 2016-12-30 14:31 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: kwolf, mreitz, den, eblake, armbru

Write-through is another way to keep the cache up-to-date. Even if this
case will be rare, write to the cache is more optimal than drop-cache.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/pcache.c | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/block/pcache.c b/block/pcache.c
index c1cbfa7040..140f90d6d7 100644
--- a/block/pcache.c
+++ b/block/pcache.c
@@ -410,8 +410,26 @@ skip_large_request:
     return bdrv_co_preadv(bs->file, offset, bytes, qiov, flags);
 }
 
-static void pcache_invalidation(BlockDriverState *bs, uint64_t offset,
-                                uint64_t bytes)
+static void write_cache_data(PCacheNode *node, uint64_t offset, uint64_t bytes,
+                             QEMUIOVector *qiov)
+{
+    uint64_t qiov_offs = 0, node_offs = 0;
+    uint64_t size;
+    uint64_t copy;
+
+    if (offset < node->common.offset) {
+        qiov_offs = node->common.offset - offset;
+    } else {
+        node_offs = offset - node->common.offset;
+    }
+    size = ranges_overlap_size(offset, bytes, node->common.offset,
+                               node->common.bytes);
+    copy = qemu_iovec_to_buf(qiov, qiov_offs, node->data + node_offs, size);
+    assert(copy == size);
+}
+
+static void pcache_write_through(BlockDriverState *bs, uint64_t offset,
+                                 uint64_t bytes, QEMUIOVector *qiov)
 {
     BDRVPCacheState *s = bs->opaque;
     uint64_t chunk_offset = offset, chunk_bytes = bytes;
@@ -429,7 +447,7 @@ static void pcache_invalidation(BlockDriverState *bs, uint64_t offset,
         chunk_bytes = end_offs - chunk_offset;
 
         if (node->status & NODE_STATUS_COMPLETED) {
-            rbcache_remove(s->cache, &node->common);
+            write_cache_data(node, offset, bytes, qiov);
         }
     } while (end_offs > chunk_offset);
 }
@@ -442,7 +460,7 @@ static coroutine_fn int pcache_co_pwritev(BlockDriverState *bs, uint64_t offset,
     if (ret < 0) {
         return ret;
     }
-    pcache_invalidation(bs, offset, bytes);
+    pcache_write_through(bs, offset, bytes, qiov);
 
     return ret;
 }
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v2 14/18] block/pcache: up-to-date cache for removed nodes
  2016-12-30 14:31 [Qemu-devel] [PATCH v2 00/18] I/O prefetch cache Pavel Butsykin
                   ` (12 preceding siblings ...)
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 13/18] block/pcache: write through Pavel Butsykin
@ 2016-12-30 14:31 ` Pavel Butsykin
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 15/18] block/pcache: pick up parts of the cache Pavel Butsykin
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Pavel Butsykin @ 2016-12-30 14:31 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: kwolf, mreitz, den, eblake, armbru

We must be able to keep the actual data even for the removed nodes. This is
necessary when we need to defer the reading of node. Because there may come
an overlapping write to removed node in the interval between node completion
and reading of the node.

For this we move the removed nodes to remove_node_list and update
node->data on each overlapping write.

This is preparatory patch for the resolution of partial cache-hit.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/pcache.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/block/pcache.c b/block/pcache.c
index 140f90d6d7..86b28de44b 100644
--- a/block/pcache.c
+++ b/block/pcache.c
@@ -60,6 +60,7 @@ typedef struct BDRVPCacheState {
     RBCache *cache;
     uint64_t max_aio_size;
     uint64_t readahead_size;
+    QLIST_HEAD(, PCacheNode) remove_node_list;
 } BDRVPCacheState;
 
 typedef struct ReqLinkEntry {
@@ -79,6 +80,7 @@ typedef struct PCacheNode {
         NODE_STATUS_REMOVE    = 0x08,
         NODE_STATUS_DELETED   = 0x10, /* only for debugging */
     } status;
+    QLIST_ENTRY(PCacheNode) entry;
     int ref;
 } PCacheNode;
 
@@ -94,6 +96,7 @@ static void pcache_node_unref(PCacheNode *node)
         assert(node->status & NODE_STATUS_REMOVE);
         node->status |= NODE_STATUS_DELETED;
 
+        QLIST_REMOVE(node, entry);
         g_free(node->data);
         g_free(node);
     }
@@ -205,11 +208,14 @@ static bool check_request_sequence(BDRVPCacheState *s, uint64_t offset)
 
 static void pcache_node_free(RBCacheNode *rbnode, void *opaque)
 {
+    BDRVPCacheState *s = opaque;
     PCacheNode *node = container_of(rbnode, PCacheNode, common);
 
     assert(node->status == NODE_STATUS_INFLIGHT ||
            node->status == NODE_STATUS_COMPLETED);
 
+    QLIST_INSERT_HEAD(&s->remove_node_list, node, entry);
+
     node->status |= NODE_STATUS_REMOVE;
     pcache_node_unref(node);
 }
@@ -432,11 +438,12 @@ static void pcache_write_through(BlockDriverState *bs, uint64_t offset,
                                  uint64_t bytes, QEMUIOVector *qiov)
 {
     BDRVPCacheState *s = bs->opaque;
+    PCacheNode *node, *next;
     uint64_t chunk_offset = offset, chunk_bytes = bytes;
     uint64_t end_offs = offset + bytes;
 
     do {
-        PCacheNode *node = rbcache_search(s->cache, chunk_offset, chunk_bytes);
+        node = rbcache_search(s->cache, chunk_offset, chunk_bytes);
         if (node == NULL) {
             break;
         }
@@ -450,6 +457,18 @@ static void pcache_write_through(BlockDriverState *bs, uint64_t offset,
             write_cache_data(node, offset, bytes, qiov);
         }
     } while (end_offs > chunk_offset);
+
+    QLIST_FOREACH_SAFE(node, &s->remove_node_list, entry, next) {
+        if (node->status & NODE_STATUS_INFLIGHT) {
+            continue;
+        }
+        if (offset >= node->common.offset + node->common.bytes ||
+            offset + bytes <= node->common.offset)
+        {
+            continue;
+        }
+        write_cache_data(node, offset, bytes, qiov);
+    }
 }
 
 static coroutine_fn int pcache_co_pwritev(BlockDriverState *bs, uint64_t offset,
@@ -479,6 +498,7 @@ static void pcache_state_init(QemuOpts *opts, BDRVPCacheState *s)
                               RBCACHE_LRU, s);
     s->readahead_size = qemu_opt_get_size(opts, PCACHE_OPT_READAHEAD_SIZE,
                                           PCACHE_DEFAULT_READAHEAD_SIZE);
+    QLIST_INIT(&s->remove_node_list);
 }
 
 static int pcache_file_open(BlockDriverState *bs, QDict *options, int flags,
@@ -516,6 +536,8 @@ static void pcache_close(BlockDriverState *bs)
 
     rbcache_destroy(s->req_stats);
     rbcache_destroy(s->cache);
+
+    assert(QLIST_EMPTY(&s->remove_node_list));
 }
 
 static int64_t pcache_getlength(BlockDriverState *bs)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v2 15/18] block/pcache: pick up parts of the cache
  2016-12-30 14:31 [Qemu-devel] [PATCH v2 00/18] I/O prefetch cache Pavel Butsykin
                   ` (13 preceding siblings ...)
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 14/18] block/pcache: up-to-date cache for removed nodes Pavel Butsykin
@ 2016-12-30 14:31 ` Pavel Butsykin
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 16/18] block/pcache: drop used pcache nodes Pavel Butsykin
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Pavel Butsykin @ 2016-12-30 14:31 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: kwolf, mreitz, den, eblake, armbru

Provided that the request size is less than the readahead size, a partial cache
hit can occur in the following three cases:
1. The request covers the bottom part of the node
2. The request covers the upper part of the node
3. The request is between two nodes and partially covers both of them

The function pickup_parts_of_cache() covers all three cases and for the
uncached part of the request it creates a new request.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/pcache.c | 174 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 166 insertions(+), 8 deletions(-)

diff --git a/block/pcache.c b/block/pcache.c
index 86b28de44b..3f89e40b05 100644
--- a/block/pcache.c
+++ b/block/pcache.c
@@ -10,6 +10,7 @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/error-report.h"
 #include "block/block_int.h"
 #include "qapi/error.h"
 #include "qapi/qmp/qstring.h"
@@ -355,6 +356,153 @@ static void pcache_readahead_entry(void *opaque)
                         readahead_co->bytes);
 }
 
+/*
+ * Provided that the request size is less or equal than the readahead size,
+ * a partial cache hit can occur in the following three cases:
+ * 1. The request covers the bottom part of the node.
+ * 2. The request covers the upper part of the node.
+ * 3. The request is between two nodes and partially covers both of them.
+ *
+ * Therefore, the request can be divided into no more than 3 parts.
+ */
+#define PCACHE_MAX_FRAGMENT_NUM 3
+
+typedef struct PartReqEntry {
+    ReqLinkEntry rlink;
+    PCacheNode *node;
+} PartReqEntry;
+
+typedef struct PartReqDesc {
+    Coroutine *co;
+    PartReqEntry parts[PCACHE_MAX_FRAGMENT_NUM];
+    uint32_t cnt;
+    uint32_t completed;
+} PartReqDesc;
+
+typedef struct PCachePartReqCo {
+    BlockDriverState *bs;
+    uint64_t offset;
+    uint64_t bytes;
+    PartReqDesc *desc;
+} PCachePartReqCo;
+
+static void coroutine_fn pcache_co_part_req(BlockDriverState *bs,
+                                            uint64_t offset, uint64_t bytes,
+                                            PartReqDesc *req)
+{
+    BDRVPCacheState *s = bs->opaque;
+    QEMUIOVector qiov;
+    PartReqEntry *part = &req->parts[req->cnt];
+    PCacheNode *node = container_of(rbcache_node_alloc(s->cache, offset, bytes),
+                                    PCacheNode, common);
+    node->status = NODE_STATUS_INFLIGHT;
+    qemu_iovec_init(&qiov, 1);
+    qemu_iovec_add(&qiov, node->data, node->common.bytes);
+
+    part->node = node;
+    assert(++req->cnt <= PCACHE_MAX_FRAGMENT_NUM);
+
+    part->rlink.ret = bdrv_co_preadv(bs->file, offset, bytes, &qiov, 0);
+
+    node->status = NODE_STATUS_COMPLETED | NODE_STATUS_REMOVE;
+    QLIST_INSERT_HEAD(&s->remove_node_list, node, entry);
+
+    if (!qemu_coroutine_entered(req->co)) {
+        qemu_coroutine_enter(req->co);
+    } else {
+        req->completed++;
+    }
+}
+
+static void pcache_part_req_entry(void *opaque)
+{
+    PCachePartReqCo *req_co = opaque;
+
+    pcache_co_part_req(req_co->bs, req_co->offset, req_co->bytes, req_co->desc);
+}
+
+static int pickup_parts_of_cache(BlockDriverState *bs, PCacheNode *node,
+                                 uint64_t offset, uint64_t bytes,
+                                 QEMUIOVector *qiov)
+{
+    BDRVPCacheState *s = bs->opaque;
+    PartReqDesc req = {
+        .co = qemu_coroutine_self(),
+    };
+    PCachePartReqCo req_co = {
+        .bs = bs,
+        .desc = &req
+    };
+    uint64_t chunk_offset = offset, chunk_bytes = bytes;
+    uint64_t up_size;
+    int ret = 0;
+
+    do {
+        pcache_node_ref(node);
+
+        if (chunk_offset < node->common.offset) {
+            Coroutine *co;
+
+            req_co.offset = chunk_offset;
+            req_co.bytes = up_size = node->common.offset - chunk_offset;
+
+            co = qemu_coroutine_create(pcache_part_req_entry, &req_co);
+            qemu_coroutine_enter(co);
+
+            chunk_offset += up_size;
+            chunk_bytes -= up_size;
+        }
+
+        req.parts[req.cnt].node = node;
+        if (node->status & NODE_STATUS_INFLIGHT) {
+            req.parts[req.cnt].rlink.co = qemu_coroutine_self();
+            QTAILQ_INSERT_HEAD(&node->wait_list,
+                               &req.parts[req.cnt].rlink, entry);
+        } else {
+            req.completed++;
+        }
+        assert(++req.cnt <= PCACHE_MAX_FRAGMENT_NUM);
+
+        up_size = MIN(node->common.offset + node->common.bytes - chunk_offset,
+                      chunk_bytes);
+        chunk_bytes -= up_size;
+        chunk_offset += up_size;
+
+        if (chunk_bytes != 0) {
+            node = rbcache_search(s->cache, chunk_offset, chunk_bytes);
+            if (node == NULL) {
+                Coroutine *co;
+
+                req_co.offset = chunk_offset;
+                req_co.bytes = chunk_bytes;
+
+                co = qemu_coroutine_create(pcache_part_req_entry, &req_co);
+                qemu_coroutine_enter(co);
+                chunk_bytes = 0;
+            }
+        }
+    } while (chunk_bytes != 0);
+
+    while (req.completed < req.cnt) {
+        qemu_coroutine_yield();
+        req.completed++;
+    }
+
+    while (req.cnt--) {
+        PartReqEntry *part = &req.parts[req.cnt];
+        if (ret == 0) {
+            if (part->rlink.ret == 0) {
+                read_cache_data_direct(part->node, offset, bytes, qiov);
+            } else {
+                ret = part->rlink.ret;
+            }
+        }
+        pcache_node_unref(part->node);
+    }
+
+    return ret;
+}
+
 enum {
     CACHE_MISS,
     CACHE_HIT,
@@ -364,6 +512,7 @@ static int pcache_lookup_data(BlockDriverState *bs, uint64_t offset,
                               uint64_t bytes, QEMUIOVector *qiov)
 {
     BDRVPCacheState *s = bs->opaque;
+    int ret;
 
     PCacheNode *node = rbcache_search(s->cache, offset, bytes);
     if (node == NULL) {
@@ -374,14 +523,16 @@ static int pcache_lookup_data(BlockDriverState *bs, uint64_t offset,
     if (node->common.offset <= offset &&
         node->common.offset + node->common.bytes >= offset + bytes)
     {
-        int ret = read_cache_data(node, offset, bytes, qiov);
-        if (ret < 0) {
-            return ret;
-        }
-        return CACHE_HIT;
+        ret = read_cache_data(node, offset, bytes, qiov);
+
+    } else {
+        ret = pickup_parts_of_cache(bs, node, offset, bytes, qiov);
     }
 
-    return CACHE_MISS;
+    if (ret < 0) {
+        return ret;
+    }
+    return CACHE_HIT;
 }
 
 static coroutine_fn int pcache_co_preadv(BlockDriverState *bs, uint64_t offset,
@@ -484,7 +635,7 @@ static coroutine_fn int pcache_co_pwritev(BlockDriverState *bs, uint64_t offset,
     return ret;
 }
 
-static void pcache_state_init(QemuOpts *opts, BDRVPCacheState *s)
+static int pcache_state_init(QemuOpts *opts, BDRVPCacheState *s)
 {
     uint64_t stats_size = qemu_opt_get_size(opts, PCACHE_OPT_STATS_SIZE,
                                             PCACHE_DEFAULT_STATS_SIZE);
@@ -499,6 +650,13 @@ static void pcache_state_init(QemuOpts *opts, BDRVPCacheState *s)
     s->readahead_size = qemu_opt_get_size(opts, PCACHE_OPT_READAHEAD_SIZE,
                                           PCACHE_DEFAULT_READAHEAD_SIZE);
     QLIST_INIT(&s->remove_node_list);
+
+    if (s->readahead_size < s->max_aio_size) {
+        error_report("Readahead size can't be less than maximum request size"
+                     "that can be handled by pcache");
+        return -ENOTSUP;
+    }
+    return 0;
 }
 
 static int pcache_file_open(BlockDriverState *bs, QDict *options, int flags,
@@ -524,7 +682,7 @@ static int pcache_file_open(BlockDriverState *bs, QDict *options, int flags,
         error_propagate(errp, local_err);
         goto fail;
     }
-    pcache_state_init(opts, bs->opaque);
+    ret = pcache_state_init(opts, bs->opaque);
 fail:
     qemu_opts_del(opts);
     return ret;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v2 16/18] block/pcache: drop used pcache nodes
  2016-12-30 14:31 [Qemu-devel] [PATCH v2 00/18] I/O prefetch cache Pavel Butsykin
                   ` (14 preceding siblings ...)
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 15/18] block/pcache: pick up parts of the cache Pavel Butsykin
@ 2016-12-30 14:31 ` Pavel Butsykin
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 17/18] qapi: allow blockdev-add for pcache Pavel Butsykin
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: Pavel Butsykin @ 2016-12-30 14:31 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: kwolf, mreitz, den, eblake, armbru

The pcache is directed to certain situations to sequential reads. This concept
allows to drop parts of the cache that were already used, which will reduce
the size of cache and the number of displaced nodes.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/pcache.c | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/block/pcache.c b/block/pcache.c
index 3f89e40b05..6d2b54cf78 100644
--- a/block/pcache.c
+++ b/block/pcache.c
@@ -81,6 +81,7 @@ typedef struct PCacheNode {
         NODE_STATUS_REMOVE    = 0x08,
         NODE_STATUS_DELETED   = 0x10, /* only for debugging */
     } status;
+    uint64_t rdcnt;
     QLIST_ENTRY(PCacheNode) entry;
     int ref;
 } PCacheNode;
@@ -109,13 +110,17 @@ static uint64_t ranges_overlap_size(uint64_t offset1, uint64_t size1,
     return MIN(offset1 + size1, offset2 + size2) - MAX(offset1, offset2);
 }
 
-static void read_cache_data_direct(PCacheNode *node, uint64_t offset,
-                                   uint64_t bytes, QEMUIOVector *qiov)
+static void read_cache_data_direct(BlockDriverState *bs, PCacheNode *node,
+                                   uint64_t offset, uint64_t bytes,
+                                   QEMUIOVector *qiov)
 {
+    BDRVPCacheState *s = bs->opaque;
     uint64_t qiov_offs = 0, node_offs = 0;
     uint64_t size;
     uint64_t copy;
 
+    assert(node->status & NODE_STATUS_COMPLETED);
+
     if (offset < node->common.offset) {
         qiov_offs = node->common.offset - offset;
     } else {
@@ -124,11 +129,17 @@ static void read_cache_data_direct(PCacheNode *node, uint64_t offset,
     size = ranges_overlap_size(offset, bytes, node->common.offset,
                                node->common.bytes);
     copy = qemu_iovec_from_buf(qiov, qiov_offs, node->data + node_offs, size);
+    node->rdcnt += size;
+    if (node->rdcnt >= node->common.bytes &&
+        !(node->status & NODE_STATUS_REMOVE))
+    {
+        rbcache_remove(s->cache, &node->common);
+    }
     assert(copy == size);
 }
 
-static int read_cache_data(PCacheNode *node, uint64_t offset, uint64_t bytes,
-                           QEMUIOVector *qiov)
+static int read_cache_data(BlockDriverState *bs, PCacheNode *node,
+                           uint64_t offset, uint64_t bytes, QEMUIOVector *qiov)
 {
     if (node->status & NODE_STATUS_INFLIGHT) {
         ReqLinkEntry rlink = {
@@ -143,7 +154,7 @@ static int read_cache_data(PCacheNode *node, uint64_t offset, uint64_t bytes,
             return rlink.ret;
         }
     }
-    read_cache_data_direct(node, offset, bytes, qiov);
+    read_cache_data_direct(bs, node, offset, bytes, qiov);
 
     return 0;
 }
@@ -228,6 +239,7 @@ static RBCacheNode *pcache_node_alloc(uint64_t offset, uint64_t bytes,
 
     node->data = g_malloc(bytes);
     node->status = NODE_STATUS_NEW;
+    node->rdcnt = 0;
     node->ref = 1;
     QTAILQ_INIT(&node->wait_list);
 
@@ -492,7 +504,7 @@ static int pickup_parts_of_cache(BlockDriverState *bs, PCacheNode *node,
         PartReqEntry *part = &req.parts[req.cnt];
         if (ret == 0) {
             if (part->rlink.ret == 0) {
-                read_cache_data_direct(part->node, offset, bytes, qiov);
+                read_cache_data_direct(bs, part->node, offset, bytes, qiov);
             } else {
                 ret = part->rlink.ret;
             }
@@ -523,7 +535,7 @@ static int pcache_lookup_data(BlockDriverState *bs, uint64_t offset,
     if (node->common.offset <= offset &&
         node->common.offset + node->common.bytes >= offset + bytes)
     {
-        ret = read_cache_data(node, offset, bytes, qiov);
+        ret = read_cache_data(bs, node, offset, bytes, qiov);
 
     } else {
         ret = pickup_parts_of_cache(bs, node, offset, bytes, qiov);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v2 17/18] qapi: allow blockdev-add for pcache
  2016-12-30 14:31 [Qemu-devel] [PATCH v2 00/18] I/O prefetch cache Pavel Butsykin
                   ` (15 preceding siblings ...)
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 16/18] block/pcache: drop used pcache nodes Pavel Butsykin
@ 2016-12-30 14:31 ` Pavel Butsykin
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 18/18] block/pcache: add tracepoints Pavel Butsykin
  2017-01-25 16:50 ` [Qemu-devel] [PATCH v2 00/18] I/O prefetch cache Denis V. Lunev
  18 siblings, 0 replies; 20+ messages in thread
From: Pavel Butsykin @ 2016-12-30 14:31 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: kwolf, mreitz, den, eblake, armbru

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 qapi/block-core.json | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 6b42216960..00a6e15db3 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1719,6 +1719,7 @@
 # @nfs: Since 2.8
 # @replication: Since 2.8
 # @ssh: Since 2.8
+# @pcache: since 2.9
 #
 # Since: 2.0
 ##
@@ -1726,8 +1727,8 @@
   'data': [ 'archipelago', 'blkdebug', 'blkverify', 'bochs', 'cloop',
             'dmg', 'file', 'ftp', 'ftps', 'gluster', 'host_cdrom',
             'host_device', 'http', 'https', 'luks', 'nbd', 'nfs', 'null-aio',
-            'null-co', 'parallels', 'qcow', 'qcow2', 'qed', 'quorum', 'raw',
-            'replication', 'ssh', 'vdi', 'vhdx', 'vmdk', 'vpc',
+            'null-co', 'parallels', 'pcache', 'qcow', 'qcow2', 'qed', 'quorum',
+            'raw', 'replication', 'ssh', 'vdi', 'vhdx', 'vmdk', 'vpc',
             'vvfat' ] }
 
 ##
@@ -1808,6 +1809,30 @@
   'base': 'BlockdevOptionsGenericFormat',
   'data': { '*key-secret': 'str' } }
 
+##
+# @BlockdevOptionsPCache
+#
+# Driver specific block device options for pcache.
+#
+# @image:                Reference to a block device image for caching.
+#
+# @pcache-stats-size:    #optional Total volume of requests for statistics.
+#
+# @pcache-max-aio-size:  #optional Maximum size of read request which is handled
+#                        by pcache.
+#
+# @pcache-full-size:     #optional Total cache size.
+#
+# @pcache-readahead-size #optional Prefetch cache readahead size.
+#
+# Since: 2.9
+##
+{ 'struct': 'BlockdevOptionsPCache',
+  'data': { 'image': 'BlockdevRef',
+            '*pcache-stats-size': 'int',
+            '*pcache-max-aio-size': 'int',
+            '*pcache-full-size': 'int',
+            '*pcache-readahead-size': 'int' } }
 
 ##
 # @BlockdevOptionsGenericCOWFormat:
@@ -2402,6 +2427,7 @@
       'null-aio':   'BlockdevOptionsNull',
       'null-co':    'BlockdevOptionsNull',
       'parallels':  'BlockdevOptionsGenericFormat',
+      'pcache':     'BlockdevOptionsPCache',
       'qcow2':      'BlockdevOptionsQcow2',
       'qcow':       'BlockdevOptionsGenericCOWFormat',
       'qed':        'BlockdevOptionsGenericCOWFormat',
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v2 18/18] block/pcache: add tracepoints
  2016-12-30 14:31 [Qemu-devel] [PATCH v2 00/18] I/O prefetch cache Pavel Butsykin
                   ` (16 preceding siblings ...)
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 17/18] qapi: allow blockdev-add for pcache Pavel Butsykin
@ 2016-12-30 14:31 ` Pavel Butsykin
  2017-01-25 16:50 ` [Qemu-devel] [PATCH v2 00/18] I/O prefetch cache Denis V. Lunev
  18 siblings, 0 replies; 20+ messages in thread
From: Pavel Butsykin @ 2016-12-30 14:31 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: kwolf, mreitz, den, eblake, armbru

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/pcache.c     | 20 ++++++++++++++++++++
 block/trace-events | 10 ++++++++++
 2 files changed, 30 insertions(+)

diff --git a/block/pcache.c b/block/pcache.c
index 6d2b54cf78..2bce3efc59 100644
--- a/block/pcache.c
+++ b/block/pcache.c
@@ -15,6 +15,7 @@
 #include "qapi/error.h"
 #include "qapi/qmp/qstring.h"
 #include "qemu/rbcache.h"
+#include "trace.h"
 
 #define PCACHE_OPT_STATS_SIZE "pcache-stats-size"
 #define PCACHE_OPT_MAX_AIO_SIZE "pcache-max-aio-size"
@@ -348,12 +349,18 @@ static void coroutine_fn pcache_co_readahead(BlockDriverState *bs,
     node->status |= NODE_STATUS_COMPLETED;
 
     if (ret < 0) {
+        trace_pcache_readahead_fail(ret, node->common.offset,
+                                    node->common.bytes);
         rbcache_remove(s->cache, &node->common);
     }
 
     QTAILQ_FOREACH_SAFE(rlink, &node->wait_list, entry, next) {
         QTAILQ_REMOVE(&node->wait_list, rlink, entry);
         rlink->ret = ret;
+
+        trace_pcache_readahead_pending_node_complete(ret, node->common.offset,
+            node->common.bytes, offset, bytes);
+
         qemu_coroutine_enter(rlink->co);
     }
 
@@ -509,6 +516,10 @@ static int pickup_parts_of_cache(BlockDriverState *bs, PCacheNode *node,
                 ret = part->rlink.ret;
             }
         }
+        if (part->rlink.ret < 0) {
+            trace_pcache_read_part_fail(ret, part->node->common.offset,
+                                        part->node->common.bytes);
+        }
         pcache_node_unref(part->node);
     }
 
@@ -618,6 +629,8 @@ static void pcache_write_through(BlockDriverState *bs, uint64_t offset,
 
         if (node->status & NODE_STATUS_COMPLETED) {
             write_cache_data(node, offset, bytes, qiov);
+            trace_pcache_write_through(offset, bytes, node->common.offset,
+                                       node->common.bytes);
         }
     } while (end_offs > chunk_offset);
 
@@ -631,6 +644,8 @@ static void pcache_write_through(BlockDriverState *bs, uint64_t offset,
             continue;
         }
         write_cache_data(node, offset, bytes, qiov);
+        trace_pcache_write_through_removed_node(offset, bytes,
+                node->common.offset, node->common.bytes);
     }
 }
 
@@ -663,6 +678,9 @@ static int pcache_state_init(QemuOpts *opts, BDRVPCacheState *s)
                                           PCACHE_DEFAULT_READAHEAD_SIZE);
     QLIST_INIT(&s->remove_node_list);
 
+    trace_pcache_state_init(stats_size, s->max_aio_size, cache_size,
+                            s->readahead_size);
+
     if (s->readahead_size < s->max_aio_size) {
         error_report("Readahead size can't be less than maximum request size"
                      "that can be handled by pcache");
@@ -704,6 +722,8 @@ static void pcache_close(BlockDriverState *bs)
 {
     BDRVPCacheState *s = bs->opaque;
 
+    trace_pcache_close(s->req_stats, s->cache);
+
     rbcache_destroy(s->req_stats);
     rbcache_destroy(s->cache);
 
diff --git a/block/trace-events b/block/trace-events
index cfc05f2478..d9cef3bcec 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -112,3 +112,13 @@ qed_aio_write_data(void *s, void *acb, int ret, uint64_t offset, size_t len) "s
 qed_aio_write_prefill(void *s, void *acb, uint64_t start, size_t len, uint64_t offset) "s %p acb %p start %"PRIu64" len %zu offset %"PRIu64
 qed_aio_write_postfill(void *s, void *acb, uint64_t start, size_t len, uint64_t offset) "s %p acb %p start %"PRIu64" len %zu offset %"PRIu64
 qed_aio_write_main(void *s, void *acb, int ret, uint64_t offset, size_t len) "s %p acb %p ret %d offset %"PRIu64" len %zu"
+
+# block/pcache.c
+pcache_readahead_fail(int ret, uint64_t offset, uint64_t bytes) "ret: %d offset: %"PRIu64" bytes: %"PRIu64
+pcache_readahead_pending_node_complete(int ret, uint64_t node_offset, uint64_t node_bytes, uint64_t read_offset, uint64_t read_bytes) "ret: %d node: %"PRIu64" %"PRIu64" pending read: %"PRIu64" %"PRIu64
+pcache_aio_read_cb_fail(int ret, uint64_t offset, uint64_t bytes) "ret: %d offset: %"PRIu64" bytes: %"PRIu64
+pcache_read_part_fail(int ret, uint64_t offset, uint64_t bytes) "ret: %d offset: %"PRIu64" bytes: %"PRIu64
+pcache_write_through(uint64_t req_offset, uint64_t req_bytes, uint64_t node_offset, uint64_t node_bytes) "request: %"PRIu64" %"PRIu64" node: %"PRIu64" %"PRIu64
+pcache_write_through_removed_node(uint64_t req_offset, uint64_t req_bytes, uint64_t node_offset, uint64_t node_bytes) "request: %"PRIu64" %"PRIu64" node: %"PRIu64" %"PRIu64
+pcache_state_init(uint64_t stats_size, uint64_t max_aio_size, uint64_t cache_size, uint64_t readahead_size) "pool statistics size: %"PRIu64" max aio size: %"PRIu64" cache size: %"PRIu64" readahead size: %"PRIu64
+pcache_close(void *req_stats, void *cache) "pool statistics: %p cache: %p"
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/18] I/O prefetch cache
  2016-12-30 14:31 [Qemu-devel] [PATCH v2 00/18] I/O prefetch cache Pavel Butsykin
                   ` (17 preceding siblings ...)
  2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 18/18] block/pcache: add tracepoints Pavel Butsykin
@ 2017-01-25 16:50 ` Denis V. Lunev
  18 siblings, 0 replies; 20+ messages in thread
From: Denis V. Lunev @ 2017-01-25 16:50 UTC (permalink / raw)
  To: Pavel Butsykin, qemu-devel, qemu-block; +Cc: kwolf, mreitz, eblake, armbru

On 12/30/2016 05:31 PM, Pavel Butsykin wrote:
> The prefetch cache aims to improve the performance of sequential read data.
> Of most interest here are the requests of a small size of data for sequential
> read, such requests can be optimized by extending them and moving into 
> the prefetch cache. However, there are 2 issues:
>  - In aggregate only a small portion of requests is sequential, so delays caused
>    by the need to read more volumes of data will lead to an overall decrease
>    in performance.
>  - The presence of redundant data in the cache memory with a large number of
>    random requests.
> This pcache implementation solves the above and other problems prefetching data.
> The pcache algorithm can be summarised by the following main steps.
>
> 1. Monitor I/O requests to identify typical sequences.
> This implementation of prefetch cache works at the storage system level and has 
> information only about the physical block addresses of I/O requests. Statistics 
> are collected only from read requests to a maximum size of 64kb(by default),
> each request that matches the criteria falls into a pool of requests. In order
> to store request statistics used by the rb-tree, it's simple but for
> this issue a quite efficient data structure.
>
> 2. Identifying sequential I/O streams.
> For each read request to be carried out attempting to lift the chain sequence 
> from the tree statistics, where this request will be element of a sequential
> chain of requests. The key to search for consecutive requests is the area of bytes 
> preceding the current request. The size of this area should not be too small to 
> avoid false readahead. The sequential stream data requests can be identified
> even when a large number of random requests. For example, if there is access to 
> the blocks 100, 1157, 27520, 4, 101, 312, 1337, 102, in the context of request
> processing 102 will be identified the chain of sequential requests 100, 101. 102
> and then should a decision be made to do readahead. Also a situation may arise
> when multiple applications A, B, C simultaneously perform sequential read of
> data. For each separate application that will be sequential read data 
> A(100, 101, 102), B(300, 301, 302), C(700, 701, 702), but for block devices it 
> may look like a random data reading: 100,300,700,101,301,701,102,302,702. 
> In this case, the sequential streams will also be recognised because location
> requests in the rb-tree will allow to separate the sequential I/O streams.
>
> 3. Do the readahead into the cache for recognized sequential data streams.
> After the issue of the detection of pcache case was resolved, need using larger 
> requests to bring data into the cache. In this implementation the pcache used
> readahead instead of the extension request, therefore the request goes as is. 
> There is not any reason to put data in the cache that will never be picked up, 
> but this will always happen in the case of extension requests. In order to store
> areas of cached blocks is also used the rb-tree, it's simple but for this issue
> a quite efficient data structure.
>
> 4. Control size of the prefetch cache pool and the requests statistic pool
> For control the border of the pool statistic of requests, the data of requests 
> are placed and replaced according to the FIFO principle, everything is simple.
> For control the boundaries of the memory cache used LRU list, it allows to limit
> the max amount memory that we can allocate for pcache. But the LRU is there
> mainly to prevent displacement of the cache blocks that was read partially. 
> The main way the memory is pushed out immediately after use, as soon as a chunk
> of memory from the cache has been completely read, since the probability of
> repetition of the request is very low. Cases when one and the same portion of
> the cache memory has been read several times are not optimized and do not apply
> to the cases that can optimize the pcache. Thus, using a cache memory of small
> volume, by the optimization of the operations read-ahead and clear memory, we
> can read entire volumes of data, providing a 100% cache hit. Also does not
> decrease the effectiveness of random read requests.
>
> PCache is implemented as a qemu block filter driver, has some configurable
> parameters, such as: total cache size, statistics size, readahead size,
> maximum size of block that can be processed.
>
> For performance evaluation has been used several test cases with different
> sequential and random read data on SSD disk. Here are the results of tests and
> qemu parameters:
>
> qemu parameters: 
> -machine pc,accel=kvm,usb=off,vmport=off -m 1024 -smp 8
> -drive file=/img/harddisk.hdd,if=none,cache=none,id=drive-scsi0-0-0-0,aio=native
> -drive driver=pcache,image=drive-scsi0-0-0-0,if=virtio
>
> *****************************************************************
> * Testcase                        * Results in iops             *
> *                                 *******************************
> *                                 *  clean qemu  *    pcache    *
> *****************************************************************
> * Create/open 16 file(s) of total * 21645 req/s  * 74793 req/s  *
> * size 2048.00 MB named           * 20385 req/s  * 66481 req/s  *
> * /tmp/tmp.tmp, start 4 thread(s) * 20616 req/s  * 69007 req/s  *
> * and do uncached sequential read *              *              *
> * by 4KB blocks                   *              *              *
> *****************************************************************
> * Create/open 16 file(s) of total * 84033 req/s  * 87828 req/s  *
> * size 2048.00 MB named           * 84602 req/s  * 89678 req/s  *
> * /tmp/tmp.tmp, start 4 thread(s) * 83163 req/s  * 96202 req/s  *
> * and do uncached sequential read *              *              *
> * by 4KB blocks with constant     *              *              *
> * queue len 32                    *              *              *
> *****************************************************************
> * Create/open 16 file(s) of total * 14104 req/s  * 14164 req/s  *
> * size 2048.00 MB named           * 14130 req/s  * 14232 req/s  *
> * /tmp/tmp.tmp, start 4 thread(s) * 14183 req/s  * 14080 req/s  *
> * and do uncached random read by  *              *              *
> * 4KB blocks                      *              *              *
> *****************************************************************
> * Create/open 16 file(s) of total * 23480 req/s  * 23483 req/s  *
> * size 2048.00 MB named           * 23070 req/s  * 22432 req/s  *
> * /tmp/tmp.tmp, start 4 thread(s) * 24090 req/s  * 23499 req/s  *
> * and do uncached random read by  *              *              *
> * 4KB blocks with constant queue  *              *              *
> * len 32                          *              *              *
> *****************************************************************
>
> Changes from v1:
> - avoid bdrv_aio_*() interfaces
> - add pcache to the QAPI schema
> - fix remarks and add more comments for rbcache
> - add more scenarios for "/rbcache/insert" test
> - fix rbcache/shrink/* tests
> - pcache: up-to-date cache for removed nodes
> - rewrite "block/pcache: pick up parts of the cache" patch
> - changed the statuses of nodes for a more flexible determination of
>   the node state
>
> Pavel Butsykin (18):
>   block/pcache: empty pcache driver filter
>   util/rbtree: add rbtree from linux kernel
>   util/rbcache: range-based cache core
>   tests/test-rbcache: add test cases
>   block/pcache: statistics collection read requests
>   block/pcache: skip large aio read
>   block/pcache: updating statistics for overlapping requests
>   block/pcache: add AIO readahead
>   block/pcache: skip readahead for unallocated clusters
>   block/pcache: cache invalidation on write requests
>   block/pcache: add reading data from the cache
>   block/pcache: inflight readahead request waiting for read
>   block/pcache: write through
>   block/pcache: up-to-date cache for removed nodes
>   block/pcache: pick up parts of the cache
>   block/pcache: drop used pcache nodes
>   qapi: allow blockdev-add for pcache
>   block/pcache: add tracepoints
>
>  MAINTAINERS                     |  13 +
>  block/Makefile.objs             |   1 +
>  block/pcache.c                  | 764 ++++++++++++++++++++++++++++++++++++++++
>  block/trace-events              |  10 +
>  include/qemu/rbcache.h          | 128 +++++++
>  include/qemu/rbtree.h           | 107 ++++++
>  include/qemu/rbtree_augmented.h | 235 ++++++++++++
>  qapi/block-core.json            |  30 +-
>  tests/Makefile.include          |   3 +
>  tests/test-rbcache.c            | 431 +++++++++++++++++++++++
>  util/Makefile.objs              |   2 +
>  util/rbcache.c                  | 253 +++++++++++++
>  util/rbtree.c                   | 570 ++++++++++++++++++++++++++++++
>  13 files changed, 2545 insertions(+), 2 deletions(-)
>  create mode 100644 block/pcache.c
>  create mode 100644 include/qemu/rbcache.h
>  create mode 100644 include/qemu/rbtree.h
>  create mode 100644 include/qemu/rbtree_augmented.h
>  create mode 100644 tests/test-rbcache.c
>  create mode 100644 util/rbcache.c
>  create mode 100644 util/rbtree.c
>
ping?

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2017-01-25 16:51 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-30 14:31 [Qemu-devel] [PATCH v2 00/18] I/O prefetch cache Pavel Butsykin
2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 01/18] block/pcache: empty pcache driver filter Pavel Butsykin
2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 02/18] util/rbtree: add rbtree from linux kernel Pavel Butsykin
2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 03/18] util/rbcache: range-based cache core Pavel Butsykin
2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 04/18] tests/test-rbcache: add test cases Pavel Butsykin
2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 05/18] block/pcache: statistics collection read requests Pavel Butsykin
2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 06/18] block/pcache: skip large aio read Pavel Butsykin
2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 07/18] block/pcache: updating statistics for overlapping requests Pavel Butsykin
2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 08/18] block/pcache: add AIO readahead Pavel Butsykin
2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 09/18] block/pcache: skip readahead for unallocated clusters Pavel Butsykin
2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 10/18] block/pcache: cache invalidation on write requests Pavel Butsykin
2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 11/18] block/pcache: add reading data from the cache Pavel Butsykin
2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 12/18] block/pcache: inflight readahead request waiting for read Pavel Butsykin
2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 13/18] block/pcache: write through Pavel Butsykin
2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 14/18] block/pcache: up-to-date cache for removed nodes Pavel Butsykin
2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 15/18] block/pcache: pick up parts of the cache Pavel Butsykin
2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 16/18] block/pcache: drop used pcache nodes Pavel Butsykin
2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 17/18] qapi: allow blockdev-add for pcache Pavel Butsykin
2016-12-30 14:31 ` [Qemu-devel] [PATCH v2 18/18] block/pcache: add tracepoints Pavel Butsykin
2017-01-25 16:50 ` [Qemu-devel] [PATCH v2 00/18] I/O prefetch cache Denis V. Lunev

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.