All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] qcow2: compressed write cache
@ 2021-01-29 16:50 Vladimir Sementsov-Ogievskiy
  2021-01-29 16:50 ` [PATCH 1/7] qemu/queue: add some useful QLIST_ and QTAILQ_ macros Vladimir Sementsov-Ogievskiy
                   ` (9 more replies)
  0 siblings, 10 replies; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-01-29 16:50 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, vsementsov, armbru, qemu-devel, mreitz, den

Hi all!

I know, I have several series waiting for a resend, but I had to switch
to another task spawned from our customer's bug.

Original problem: we use O_DIRECT for all vm images in our product, it's
the policy. The only exclusion is backup target qcow2 image for
compressed backup, because compressed backup is extremely slow with
O_DIRECT (due to unaligned writes). Customer complains that backup
produces a lot of pagecache.

So we can either implement some internal cache or use fadvise somehow.
Backup has several async workes, which writes simultaneously, so in both
ways we have to track host cluster filling (before dropping the cache
corresponding to the cluster).  So, if we have to track anyway, let's
try to implement the cache.

Idea is simple: cache small unaligned write and flush the cluster when
filled.

Performance result is very good (results in a table is time of
compressed backup of 1000M disk filled with ones in seconds):

---------------  -----------  -----------
                 backup(old)  backup(new)
ssd:hdd(direct)  3e+02        4.4
                                -99%
ssd:hdd(cached)  5.7          5.4
                                -5%
---------------  -----------  -----------

So, we have benefit even for cached mode! And the fastest thing is
O_DIRECT with new implemented cache. So, I suggest to enable the new
cache by default (which is done by the series).

Command to generate performance comparison table:

 ./scripts/simplebench/bench-backup.py --compressed --target-cache both --dir ssd:/ssd --dir hdd:/work --test ssd:hdd --env old:/work/src/qemu/master/build/qemu-system-x86_64 new:/work/src/qemu/up/qcow2-compressed-write-cache/build/qemu-system-x86_6

Vladimir Sementsov-Ogievskiy (7):
  qemu/queue: add some useful QLIST_ and QTAILQ_ macros
  block/qcow2: introduce cache for compressed writes
  block/qcow2: use compressed write cache
  simplebench: bench_one(): add slow_limit argument
  simplebench: bench_one(): support count=1
  simplebench/bench-backup: add --compressed option
  simplebench/bench-backup: add target-cache argument

 qapi/block-core.json                   |   8 +-
 block/qcow2.h                          |  33 ++
 include/qemu/queue.h                   |  14 +
 block/qcow2-compressed-write-cache.c   | 770 +++++++++++++++++++++++++
 block/qcow2-refcount.c                 |  13 +
 block/qcow2.c                          |  87 ++-
 block/meson.build                      |   1 +
 scripts/simplebench/bench-backup.py    |  74 ++-
 scripts/simplebench/bench_block_job.py |  33 +-
 scripts/simplebench/simplebench.py     |  34 +-
 10 files changed, 1039 insertions(+), 28 deletions(-)
 create mode 100644 block/qcow2-compressed-write-cache.c

-- 
2.29.2



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/7] qemu/queue: add some useful QLIST_ and QTAILQ_ macros
  2021-01-29 16:50 [PATCH 0/7] qcow2: compressed write cache Vladimir Sementsov-Ogievskiy
@ 2021-01-29 16:50 ` Vladimir Sementsov-Ogievskiy
  2021-02-01  8:29   ` Markus Armbruster
  2021-02-10 17:07   ` Max Reitz
  2021-01-29 16:50 ` [PATCH 2/7] block/qcow2: introduce cache for compressed writes Vladimir Sementsov-Ogievskiy
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-01-29 16:50 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, vsementsov, armbru, qemu-devel, mreitz, den

Add QLIST_FOREACH_FUNC_SAFE(), QTAILQ_FOREACH_FUNC_SAFE() and
QTAILQ_POP_HEAD(), to be used in following commit.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 include/qemu/queue.h | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/include/qemu/queue.h b/include/qemu/queue.h
index e029e7bf66..03e1fce85f 100644
--- a/include/qemu/queue.h
+++ b/include/qemu/queue.h
@@ -173,6 +173,13 @@ struct {                                                                \
                 (var) && ((next_var) = ((var)->field.le_next), 1);      \
                 (var) = (next_var))
 
+#define QLIST_FOREACH_FUNC_SAFE(head, field, func) do {                 \
+    typeof(*QLIST_FIRST(head)) *qffs_var, *qffs_next_var;               \
+    QLIST_FOREACH_SAFE(qffs_var, (head), field, qffs_next_var) {        \
+        (func)(qffs_var);                                               \
+    }                                                                   \
+} while (/*CONSTCOND*/0)
+
 /*
  * List access methods.
  */
@@ -490,6 +497,13 @@ union {                                                                 \
              (var) && ((prev_var) = QTAILQ_PREV(var, field), 1);        \
              (var) = (prev_var))
 
+#define QTAILQ_FOREACH_FUNC_SAFE(head, field, func) do {                \
+    typeof(*QTAILQ_FIRST(head)) *qffs_var, *qffs_next_var;              \
+    QTAILQ_FOREACH_SAFE(qffs_var, (head), field, qffs_next_var) {       \
+        (func)(qffs_var);                                               \
+    }                                                                   \
+} while (/*CONSTCOND*/0)
+
 /*
  * Tail queue access methods.
  */
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/7] block/qcow2: introduce cache for compressed writes
  2021-01-29 16:50 [PATCH 0/7] qcow2: compressed write cache Vladimir Sementsov-Ogievskiy
  2021-01-29 16:50 ` [PATCH 1/7] qemu/queue: add some useful QLIST_ and QTAILQ_ macros Vladimir Sementsov-Ogievskiy
@ 2021-01-29 16:50 ` Vladimir Sementsov-Ogievskiy
  2021-02-10 17:07   ` Max Reitz
  2021-01-29 16:50 ` [PATCH 3/7] block/qcow2: use compressed write cache Vladimir Sementsov-Ogievskiy
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-01-29 16:50 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, vsementsov, armbru, qemu-devel, mreitz, den

Compressed writes and O_DIRECT are not friends: they works too slow,
because compressed writes does many small unaligned to 512 writes.

Let's introduce an internal cache, so that compressed writes may work
well when O_DIRECT is on.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/qcow2.h                        |  29 +
 block/qcow2-compressed-write-cache.c | 770 +++++++++++++++++++++++++++
 block/meson.build                    |   1 +
 3 files changed, 800 insertions(+)
 create mode 100644 block/qcow2-compressed-write-cache.c

diff --git a/block/qcow2.h b/block/qcow2.h
index 0678073b74..fbdedf89fa 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -322,6 +322,8 @@ typedef struct Qcow2BitmapHeaderExt {
     uint64_t bitmap_directory_offset;
 } QEMU_PACKED Qcow2BitmapHeaderExt;
 
+typedef struct Qcow2CompressedWriteCache Qcow2CompressedWriteCache;
+
 #define QCOW2_MAX_THREADS 4
 
 typedef struct BDRVQcow2State {
@@ -1010,4 +1012,31 @@ int coroutine_fn
 qcow2_co_decrypt(BlockDriverState *bs, uint64_t host_offset,
                  uint64_t guest_offset, void *buf, size_t len);
 
+Qcow2CompressedWriteCache *qcow2_compressed_cache_new(BdrvChild *data_file,
+                                                      int64_t cluster_size,
+                                                      int64_t cache_size);
+void qcow2_compressed_cache_free(Qcow2CompressedWriteCache *s);
+int coroutine_fn
+qcow2_compressed_cache_co_read(Qcow2CompressedWriteCache *s, int64_t offset,
+                               int64_t bytes, void *buf);
+int coroutine_fn
+qcow2_compressed_cache_co_write(Qcow2CompressedWriteCache *s, int64_t offset,
+                                int64_t bytes, void *buf);
+void coroutine_fn
+qcow2_compressed_cache_co_set_cluster_end(Qcow2CompressedWriteCache *s,
+                                          int64_t cluster_data_end);
+int coroutine_fn
+qcow2_compressed_cache_co_flush(Qcow2CompressedWriteCache *s);
+int qcow2_compressed_cache_flush(BlockDriverState *bs,
+                                 Qcow2CompressedWriteCache *state);
+int coroutine_fn
+qcow2_compressed_cache_co_stop_flush(Qcow2CompressedWriteCache *s);
+int qcow2_compressed_cache_stop_flush(BlockDriverState *bs,
+                                      Qcow2CompressedWriteCache *s);
+void qcow2_compressed_cache_set_size(Qcow2CompressedWriteCache *s,
+                                     int64_t size);
+void coroutine_fn
+qcow2_compressed_cache_co_discard(Qcow2CompressedWriteCache *s,
+                                  int64_t cluster_offset);
+
 #endif
diff --git a/block/qcow2-compressed-write-cache.c b/block/qcow2-compressed-write-cache.c
new file mode 100644
index 0000000000..7bb92cb550
--- /dev/null
+++ b/block/qcow2-compressed-write-cache.c
@@ -0,0 +1,770 @@
+/*
+ * Write cache for qcow2 compressed writes
+ *
+ * Copyright (c) 2021 Virtuozzo International GmbH.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#include "qemu/osdep.h"
+
+#include "block/block_int.h"
+#include "block/block-gen.h"
+#include "qemu/coroutine.h"
+#include "qapi/qapi-events-block-core.h"
+#include "qcow2.h"
+
+typedef struct CacheExtent {
+    int64_t offset;
+    int64_t bytes;
+    void *buf;
+    QLIST_ENTRY(CacheExtent) next;
+} CacheExtent;
+
+typedef struct CacheCluster {
+    int64_t cluster_offset;
+    int64_t n_bytes; /* sum of extents lengths */
+
+    /*
+     * data_end: cluster is full if data_end reached and ready to be flushed.
+     * data_end is absoluted offset, not relative.
+     */
+    int64_t data_end;
+
+    bool in_flight; /* cluster is being flushed now */
+
+    /*
+     * waiters: coroutines to wake after flush. Must be empty when in_flight is
+     * false
+     */
+    CoQueue waiters;
+
+    QTAILQ_ENTRY(CacheCluster) next;
+    QLIST_HEAD(, CacheExtent) extents; /* sorted by offset */
+} CacheCluster;
+
+typedef QTAILQ_HEAD(, CacheCluster) CacheClusterQueue;
+
+struct Qcow2CompressedWriteCache {
+    /*
+     * @data_file and @cluster_size are copied from qcow2 state. Not huge
+     * duplications, seems better to avoid accessing the whole qcow2 state
+     * instead.
+     */
+    BdrvChild *data_file;
+    int64_t cluster_size;
+
+    CoQueue waiters; /* coroutines, waiting for free place in the cache */
+
+    /*
+     * Cache limits not total number of clusters but total number of active
+     * clusters. Active clusters are clusters with non-zero @n_bytes (and
+     * therefor non-empty @extents). This is done so we don't need to wait for
+     * cache flush qcow2_compressed_cache_set_cluster_end() (which may create
+     * cluster with @data_end set but zero @n_bytes), as
+     * qcow2_compressed_cache_set_cluster_end() is intended to be called from
+     * qcow2 mutex critical section.
+     */
+    int nb_active_clusters;
+
+    /*
+     * If max_active_clusters is 0 it means that cache is inactive: all new
+     * writes should fallthrough to data_file immediately.
+     */
+    int max_active_clusters;
+
+    CacheClusterQueue clusters;
+};
+
+/* New extent takes ownership of @buf */
+static CacheExtent *cache_extent_new(int64_t offset, int64_t bytes, void *buf)
+{
+    CacheExtent *e = g_new(CacheExtent, 1);
+
+    *e = (CacheExtent) {
+        .offset = offset,
+        .bytes = bytes,
+        .buf = buf,
+    };
+
+    return e;
+}
+
+static void cache_extent_free(CacheExtent *e)
+{
+    if (e) {
+        g_free(e->buf);
+        g_free(e);
+    }
+}
+
+static CacheCluster *find_cluster(Qcow2CompressedWriteCache *s,
+                                  int64_t cluster_offset)
+{
+    CacheCluster *c;
+
+    assert(QEMU_IS_ALIGNED(cluster_offset, s->cluster_size));
+
+    QTAILQ_FOREACH(c, &s->clusters, next) {
+        if (cluster_offset == c->cluster_offset) {
+            return c;
+        }
+    }
+
+    return NULL;
+}
+
+/* Creates "inactive" cluster, which doesn't influence s->nb_active_clusters */
+static CacheCluster *cache_cluster_new(Qcow2CompressedWriteCache *s,
+                                       int64_t cluster_offset)
+{
+    CacheCluster *c;
+
+    assert(!find_cluster(s, cluster_offset));
+
+    c = g_new(CacheCluster, 1);
+    *c = (CacheCluster) {
+        .cluster_offset = cluster_offset,
+        .data_end = cluster_offset + s->cluster_size
+    };
+
+    qemu_co_queue_init(&c->waiters);
+    QTAILQ_INSERT_TAIL(&s->clusters, c, next);
+
+    return c;
+}
+
+static void cache_cluster_free(CacheCluster *cluster)
+{
+    assert(!cluster->in_flight);
+    assert(qemu_co_queue_empty(&cluster->waiters));
+    QLIST_FOREACH_FUNC_SAFE(&cluster->extents, next, cache_extent_free);
+    g_free(cluster);
+}
+
+static bool cache_cluster_is_full(CacheCluster *cluster)
+{
+    assert(cluster->cluster_offset + cluster->n_bytes <= cluster->data_end);
+    return cluster->cluster_offset + cluster->n_bytes == cluster->data_end;
+}
+
+static void cache_cluster_remove(Qcow2CompressedWriteCache *s,
+                                 CacheCluster *cluster)
+{
+    if (cluster->n_bytes) {
+        s->nb_active_clusters--;
+    }
+    QTAILQ_REMOVE(&s->clusters, cluster, next);
+    cache_cluster_free(cluster);
+}
+
+/*
+ * Return number of consequtive clusters starting from @first. Store next after
+ * last extent pointer into @next, store end offset of last extent into
+ * @end_off.
+ */
+static int count_consequitive_extents(CacheExtent *first,
+                                      CacheExtent **next,
+                                      int64_t *end_off)
+{
+    CacheExtent *e;
+    int64_t off = first->offset;
+    int nb = 0;
+
+    for (e = first; e; e = QLIST_NEXT(e, next)) {
+        assert(e->offset >= off);
+        if (e->offset > off) {
+            break;
+        }
+        off += e->bytes;
+        nb++;
+    }
+
+    if (next) {
+        *next = e;
+    }
+
+    if (end_off) {
+        *end_off = off;
+    }
+
+    return nb;
+}
+
+/*
+ * Write consequtive extents, starting from @firest. Store next after last
+ * extent pointer into @next. If align > 1, align end of the whole write by
+ * zero.
+ */
+static int coroutine_fn flush_consequitive_extents(Qcow2CompressedWriteCache *s,
+                                                   CacheExtent *first,
+                                                   CacheExtent **next,
+                                                   int64_t align)
+{
+    CacheExtent *e, *end_extent;
+    int64_t end;
+    int nb_extents = count_consequitive_extents(first, &end_extent, &end);
+    int64_t aligned_end = QEMU_ALIGN_UP(end, align);
+    int64_t tail = aligned_end - end;
+    int64_t len = aligned_end - first->offset;
+
+    /*
+     * Alignment if for flushing full cluster, first extent offset is always
+     * aligned.
+     */
+    assert(QEMU_IS_ALIGNED(first->offset, align));
+
+    if (next) {
+        *next = end_extent;
+    }
+
+    if (tail) {
+        nb_extents++;
+    }
+
+    if (nb_extents > IOV_MAX) {
+        g_autofree void *buf = g_malloc(len);
+        char *p = buf;
+
+        for (e = first; e != end_extent; e = QLIST_NEXT(e, next)) {
+            memcpy(p, e->buf, e->bytes);
+            p += e->bytes;
+        }
+
+        if (tail) {
+            memset(p, 0, tail);
+        }
+
+        BLKDBG_EVENT(s->data_file, BLKDBG_WRITE_COMPRESSED);
+        return bdrv_co_pwrite(s->data_file, first->offset, len, buf, 0);
+    } else {
+        int ret;
+        QEMUIOVector qiov;
+        g_autofree void *tail_buf = NULL;
+
+        qemu_iovec_init(&qiov, nb_extents);
+        for (e = first; e != end_extent; e = QLIST_NEXT(e, next)) {
+            qemu_iovec_add(&qiov, e->buf, e->bytes);
+        }
+
+        if (tail) {
+            tail_buf = g_malloc0(tail);
+            qemu_iovec_add(&qiov, tail_buf, tail);
+        }
+
+        BLKDBG_EVENT(s->data_file, BLKDBG_WRITE_COMPRESSED);
+        ret = bdrv_co_pwritev(s->data_file, first->offset, len, &qiov, 0);
+        qemu_iovec_destroy(&qiov);
+        return ret;
+    }
+}
+
+static int coroutine_fn cache_cluster_flush_full(Qcow2CompressedWriteCache *s,
+                                                 CacheCluster *cluster)
+{
+    int ret;
+    CacheExtent *end_extent;
+    int64_t align = MIN(s->cluster_size,
+                        MAX(s->data_file->bs->bl.request_alignment, 4 * 1024));
+
+    assert(cache_cluster_is_full(cluster));
+
+    ret = flush_consequitive_extents(s, QLIST_FIRST(&cluster->extents),
+                                     &end_extent, align);
+
+    assert(end_extent == NULL); /* all extents flushed */
+
+    return ret;
+}
+
+static int coroutine_fn cache_cluster_flush(Qcow2CompressedWriteCache *s,
+                                            CacheCluster *c)
+{
+    int ret;
+    CacheExtent *e = QLIST_FIRST(&c->extents);
+
+    if (cache_cluster_is_full(c)) {
+        return cache_cluster_flush_full(s, c);
+    }
+
+    while (e) {
+        ret = flush_consequitive_extents(s, e, &e, 1);
+        if (ret < 0) {
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
+int coroutine_fn qcow2_compressed_cache_co_flush(Qcow2CompressedWriteCache *s)
+{
+    int ret = 0;
+    CacheCluster *c;
+    GList *local_clusters = NULL, *p;
+
+    /*
+     * Make a snapshot of current state: we will not flush clusters created in
+     * parallel with flush operations and don't allow adding more extents to
+     * staged clusters. We are also protected from parallel flush operations
+     * flushing the same clusters.
+     */
+    QTAILQ_FOREACH(c, &s->clusters, next) {
+        if (!c->in_flight && c->n_bytes) {
+            c->in_flight = true;
+            local_clusters = g_list_append(local_clusters, c);
+        }
+    }
+
+    for (p = local_clusters; p; p = p->next) {
+        CacheCluster *c = p->data;
+
+        if (ret == 0) {
+            ret = cache_cluster_flush(s, c);
+        }
+
+        c->in_flight = false;
+        qemu_co_queue_restart_all(&c->waiters);
+
+        if (ret == 0) {
+            cache_cluster_remove(s, c);
+        }
+    }
+
+    g_list_free(local_clusters);
+
+    return ret;
+}
+
+int coroutine_fn
+qcow2_compressed_cache_co_stop_flush(Qcow2CompressedWriteCache *s)
+{
+    int ret, save;
+
+    save = s->max_active_clusters;
+    s->max_active_clusters = 0; /* No more extents */
+
+    ret = qcow2_compressed_cache_co_flush(s);
+    if (ret < 0) {
+        s->max_active_clusters = save;
+        return ret;
+    }
+
+    assert(QTAILQ_EMPTY(&s->clusters));
+    return 0;
+}
+
+/* @cluster takes ownership of @extent */
+static void cluster_add_extent(Qcow2CompressedWriteCache *s,
+                               CacheCluster *cluster, CacheExtent *extent)
+{
+    CacheExtent *e;
+
+    assert(extent->bytes);
+    assert(extent->offset >= cluster->cluster_offset);
+    assert(extent->offset + extent->bytes <= cluster->data_end);
+    assert(!cluster->in_flight);
+
+    e = QLIST_FIRST(&cluster->extents);
+    if (!e) {
+        /* inactive cluster */
+        assert(!cluster->n_bytes);
+        s->nb_active_clusters++;
+        assert(s->nb_active_clusters <= s->max_active_clusters);
+        QLIST_INSERT_HEAD(&cluster->extents, extent, next);
+    } else if (e->offset > extent->offset) {
+        assert(extent->offset + extent->bytes <= e->offset);
+        QLIST_INSERT_HEAD(&cluster->extents, extent, next);
+    } else {
+        while (QLIST_NEXT(e, next) &&
+               QLIST_NEXT(e, next)->offset < extent->offset) {
+            e = QLIST_NEXT(e, next);
+        }
+
+        /* Now e is last element with offset < extent->offset */
+        assert(e->offset + e->bytes <= extent->offset);
+
+        QLIST_INSERT_AFTER(e, extent, next);
+
+        e = QLIST_NEXT(extent, next);
+        if (e) {
+            assert(extent->offset + extent->bytes <= e->offset);
+        }
+    }
+
+    cluster->n_bytes += extent->bytes;
+}
+
+static CacheCluster *find_cluster_to_flush(Qcow2CompressedWriteCache *s)
+{
+    CacheCluster *c;
+
+    QTAILQ_FOREACH(c, &s->clusters, next) {
+        if (!c->in_flight && cache_cluster_is_full(c)) {
+            return c;
+        }
+    }
+
+    return NULL;
+}
+
+/* Cache an extent if there is a place */
+static bool coroutine_fn
+try_cache_extent(Qcow2CompressedWriteCache *s, CacheExtent *extent,
+                 bool *cluster_in_flight)
+{
+    CacheCluster *c;
+    int64_t cluster_offset = QEMU_ALIGN_DOWN(extent->offset, s->cluster_size);
+
+    assert(extent->bytes);
+
+    if (s->max_active_clusters == 0) {
+        *cluster_in_flight = true;
+        return false;
+    }
+
+    *cluster_in_flight = false;
+
+    c = find_cluster(s, cluster_offset);
+    if (c && c->in_flight) {
+        *cluster_in_flight = true;
+        return false;
+    }
+    if (s->nb_active_clusters >= s->max_active_clusters &&
+        (!c || !c->n_bytes))
+    {
+        /*
+         * Cache is full, we can't allocate a new cluster and can't activate
+         * existing inactive cluster
+         */
+        return false;
+    }
+
+    if (!c) {
+        c = cache_cluster_new(s, cluster_offset);
+    }
+
+    cluster_add_extent(s, c, extent);
+
+    if (cache_cluster_is_full(c)) {
+        qemu_co_queue_restart_all(&s->waiters);
+    }
+
+    return true;
+}
+
+/* Takes ownership of @buf, don't free it after call! */
+int coroutine_fn
+qcow2_compressed_cache_co_write(Qcow2CompressedWriteCache *s, int64_t offset,
+                                int64_t bytes, void *buf)
+{
+    int ret;
+    int i;
+    CacheExtent *extents[] = {NULL, NULL};
+    int nb_extents = 0; /* number of non-NULL elements in @extents */
+    int64_t e0_len;
+
+    assert(bytes > 0);
+    assert(bytes < s->cluster_size);
+
+    e0_len = MIN(bytes, QEMU_ALIGN_UP(offset + 1, s->cluster_size) - offset);
+    extents[0] = cache_extent_new(offset, e0_len, buf);
+    nb_extents = 1;
+
+    if (bytes > e0_len) {
+        int64_t e1_len = bytes - e0_len;
+        /*
+         * We have to allocate a new buffer, so that clusters are independent
+         * and can free their extents when needed.
+         */
+        void *e1_buf = g_memdup(((const char *)buf) + e0_len, e1_len);
+
+        extents[1] = cache_extent_new(offset + e0_len, e1_len, e1_buf);
+        nb_extents = 2;
+    }
+
+    while (nb_extents) {
+        bool do_write = false;
+
+        for (i = 0; i < 2; i++) {
+            CacheExtent *e = extents[i];
+
+            do_write = false;
+
+            if (!e) {
+                continue;
+            }
+
+            if (try_cache_extent(s, e, &do_write)) {
+                extents[i] = NULL;
+                nb_extents--;
+                continue;
+            }
+
+            if (do_write) {
+                ret = bdrv_co_pwrite(s->data_file, e->offset, e->bytes,
+                                     e->buf, 0);
+
+                cache_extent_free(e);
+                extents[i] = NULL;
+                nb_extents--;
+
+                if (ret < 0) {
+                    goto out;
+                }
+            }
+        }
+
+        if (do_write) {
+            /*
+             * We yielded during second extent writing. Probably the cache is
+             * already free and we can now cache the first extent.
+             */
+            continue;
+        }
+
+        if (nb_extents) {
+            CacheCluster *cluster = find_cluster_to_flush(s);
+
+            if (cluster) {
+                cluster->in_flight = true;
+                ret = cache_cluster_flush_full(s, cluster);
+                cluster->in_flight = false;
+                qemu_co_queue_restart_all(&cluster->waiters);
+                qemu_co_queue_restart_all(&s->waiters);
+                if (ret < 0) {
+                    goto out;
+                }
+                cache_cluster_remove(s, cluster);
+                continue;
+            }
+
+            qemu_co_queue_wait(&s->waiters, NULL);
+        }
+    }
+
+    ret = 0;
+
+out:
+    for (i = 0; i < 2; i++) {
+        cache_extent_free(extents[i]);
+    }
+
+    return 0;
+}
+
+int coroutine_fn
+qcow2_compressed_cache_co_read(Qcow2CompressedWriteCache *s, int64_t offset,
+                               int64_t bytes, void *buf)
+{
+    CacheCluster *c;
+    CacheExtent *e;
+    int64_t cluster_offset = QEMU_ALIGN_DOWN(offset, s->cluster_size);
+
+    c = find_cluster(s, cluster_offset);
+    if (c) {
+        QLIST_FOREACH(e, &c->extents, next) {
+            if (e->offset == offset && e->bytes <= bytes) {
+                memcpy(buf, e->buf, e->bytes);
+                return 0;
+            }
+        }
+    }
+
+    return bdrv_co_pread(s->data_file, offset, bytes, buf, 0);
+}
+
+/*
+ * Caller states, that there would be no writes to this cluster beyond
+ * specified @cluster_data_end. So, it's OK to flush cluster when it is filled
+ * up to @cluster_data_end and it's OK to align flushing write operation up to
+ * some alignment (not greater than cluster_size of course).
+ */
+void coroutine_fn
+qcow2_compressed_cache_co_set_cluster_end(Qcow2CompressedWriteCache *s,
+                                          int64_t cluster_data_end)
+{
+    int64_t cluster_offset = QEMU_ALIGN_DOWN(cluster_data_end, s->cluster_size);
+    CacheExtent *e;
+    CacheCluster *c;
+
+    c = find_cluster(s, cluster_offset);
+    if (!c) {
+        c = cache_cluster_new(s, cluster_offset);
+    }
+
+    QLIST_FOREACH(e, &c->extents, next) {
+        assert(e->offset + e->bytes <= cluster_data_end);
+    }
+
+    /* Shouldn't set data_end several times */
+    assert(c->data_end == c->cluster_offset + s->cluster_size);
+
+    c->data_end = cluster_data_end;
+    if (cache_cluster_is_full(c)) {
+        qemu_co_queue_restart_all(&s->waiters);
+    }
+}
+
+Qcow2CompressedWriteCache *qcow2_compressed_cache_new(BdrvChild *data_file,
+                                                      int64_t cluster_size,
+                                                      int64_t cache_size)
+{
+    Qcow2CompressedWriteCache *s = g_new(Qcow2CompressedWriteCache, 1);
+
+    assert(cache_size >= cluster_size);
+
+    *s = (Qcow2CompressedWriteCache) {
+        .data_file = data_file,
+        .cluster_size = cluster_size,
+        .max_active_clusters = cache_size / cluster_size,
+    };
+
+    qemu_co_queue_init(&s->waiters);
+    QTAILQ_INIT(&s->clusters);
+
+    return s;
+}
+
+void qcow2_compressed_cache_free(Qcow2CompressedWriteCache *s)
+{
+    if (!s) {
+        return;
+    }
+
+    QTAILQ_FOREACH_FUNC_SAFE(&s->clusters, next, cache_cluster_free);
+    g_free(s);
+}
+
+void qcow2_compressed_cache_set_size(Qcow2CompressedWriteCache *s,
+                                     int64_t size)
+{
+    /*
+     * We don't do flush here. Don't care too much, it's safe to have cache
+     * larger than maximum, it will only decrease until it reaches the new
+     * maximum.
+     */
+    assert(size >= s->cluster_size);
+    s->max_active_clusters = size / s->cluster_size;
+}
+
+void coroutine_fn
+qcow2_compressed_cache_co_discard(Qcow2CompressedWriteCache *s,
+                                  int64_t cluster_offset)
+{
+    CacheCluster *c;
+
+    while (true) {
+        c = find_cluster(s, cluster_offset);
+        if (!c) {
+            return;
+        }
+        if (!c->in_flight) {
+            cache_cluster_remove(s, c);
+            return;
+        }
+        qemu_co_queue_wait(&c->waiters, NULL);
+    }
+}
+
+
+/*
+ * Wrappers for qcow2_compressed_cache_co_flush
+ *
+ * TODO: update scripts/block-coroutine-wrapper to generate this too
+ */
+
+typedef struct Qcow2CompressedCacheCoFlush {
+    BdrvPollCo poll_state;
+    Qcow2CompressedWriteCache *state;
+} Qcow2CompressedCacheCoFlush;
+
+static void coroutine_fn
+qcow2_compressed_cache_co_flush_entry(void *opaque)
+{
+    Qcow2CompressedCacheCoFlush *s = opaque;
+
+    s->poll_state.ret = qcow2_compressed_cache_co_flush(s->state);
+    s->poll_state.in_progress = false;
+
+    aio_wait_kick();
+}
+
+int qcow2_compressed_cache_flush(BlockDriverState *bs,
+                                 Qcow2CompressedWriteCache *state)
+{
+    if (qemu_in_coroutine()) {
+        return qcow2_compressed_cache_co_flush(state);
+    } else {
+        Qcow2CompressedCacheCoFlush s = {
+            .poll_state.bs = bs,
+            .poll_state.in_progress = true,
+
+            .state = state,
+        };
+
+        s.poll_state.co =
+            qemu_coroutine_create(qcow2_compressed_cache_co_flush_entry, &s);
+
+        return bdrv_poll_co(&s.poll_state);
+    }
+}
+
+/*
+ * Wrappers for qcow2_compressed_cache_co_stop_flush
+ *
+ * TODO: update scripts/block-coroutine-wrapper to generate this too
+ */
+
+typedef struct Qcow2CompressedCacheCoStopFlush {
+    BdrvPollCo poll_state;
+    Qcow2CompressedWriteCache *state;
+} Qcow2CompressedCacheCoStopFlush;
+
+static void coroutine_fn
+qcow2_compressed_cache_co_stop_flush_entry(void *opaque)
+{
+    Qcow2CompressedCacheCoStopFlush *s = opaque;
+
+    s->poll_state.ret = qcow2_compressed_cache_co_stop_flush(s->state);
+    s->poll_state.in_progress = false;
+
+    aio_wait_kick();
+}
+
+int qcow2_compressed_cache_stop_flush(BlockDriverState *bs,
+                                      Qcow2CompressedWriteCache *state)
+{
+    if (qemu_in_coroutine()) {
+        return qcow2_compressed_cache_co_stop_flush(state);
+    } else {
+        Qcow2CompressedCacheCoStopFlush s = {
+            .poll_state.bs = bs,
+            .poll_state.in_progress = true,
+
+            .state = state,
+        };
+
+        s.poll_state.co =
+            qemu_coroutine_create(qcow2_compressed_cache_co_stop_flush_entry,
+                                  &s);
+
+        return bdrv_poll_co(&s.poll_state);
+    }
+}
diff --git a/block/meson.build b/block/meson.build
index eeaefe5809..7b94794c28 100644
--- a/block/meson.build
+++ b/block/meson.build
@@ -25,6 +25,7 @@ block_ss.add(files(
   'qcow2-bitmap.c',
   'qcow2-cache.c',
   'qcow2-cluster.c',
+  'qcow2-compressed-write-cache.c',
   'qcow2-refcount.c',
   'qcow2-snapshot.c',
   'qcow2-threads.c',
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/7] block/qcow2: use compressed write cache
  2021-01-29 16:50 [PATCH 0/7] qcow2: compressed write cache Vladimir Sementsov-Ogievskiy
  2021-01-29 16:50 ` [PATCH 1/7] qemu/queue: add some useful QLIST_ and QTAILQ_ macros Vladimir Sementsov-Ogievskiy
  2021-01-29 16:50 ` [PATCH 2/7] block/qcow2: introduce cache for compressed writes Vladimir Sementsov-Ogievskiy
@ 2021-01-29 16:50 ` Vladimir Sementsov-Ogievskiy
  2021-02-10 17:11   ` Max Reitz
  2021-01-29 16:50 ` [PATCH 4/7] simplebench: bench_one(): add slow_limit argument Vladimir Sementsov-Ogievskiy
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-01-29 16:50 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, vsementsov, armbru, qemu-devel, mreitz, den

Introduce a new option: compressed-cache-size, with default to 64
clusters (to be not less than 64 default max-workers for backup job).

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 qapi/block-core.json   |  8 +++-
 block/qcow2.h          |  4 ++
 block/qcow2-refcount.c | 13 +++++++
 block/qcow2.c          | 87 ++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 108 insertions(+), 4 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 9f555d5c1d..e0be6657f3 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -3202,6 +3202,11 @@
 #             an image, the data file name is loaded from the image
 #             file. (since 4.0)
 #
+# @compressed-cache-size: The maximum size of compressed write cache in
+#                         bytes. If positive must be not less than
+#                         cluster size. 0 disables the feature. Default
+#                         is 64 * cluster_size. (since 6.0)
+#
 # Since: 2.9
 ##
 { 'struct': 'BlockdevOptionsQcow2',
@@ -3217,7 +3222,8 @@
             '*refcount-cache-size': 'int',
             '*cache-clean-interval': 'int',
             '*encrypt': 'BlockdevQcow2Encryption',
-            '*data-file': 'BlockdevRef' } }
+            '*data-file': 'BlockdevRef',
+            '*compressed-cache-size': 'int' } }
 
 ##
 # @SshHostKeyCheckMode:
diff --git a/block/qcow2.h b/block/qcow2.h
index fbdedf89fa..b86aa06006 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -150,6 +150,7 @@
 #define QCOW2_OPT_L2_CACHE_ENTRY_SIZE "l2-cache-entry-size"
 #define QCOW2_OPT_REFCOUNT_CACHE_SIZE "refcount-cache-size"
 #define QCOW2_OPT_CACHE_CLEAN_INTERVAL "cache-clean-interval"
+#define QCOW2_OPT_COMPRESSED_CACHE_SIZE "compressed-cache-size"
 
 typedef struct QCowHeader {
     uint32_t magic;
@@ -422,6 +423,9 @@ typedef struct BDRVQcow2State {
      * is to convert the image with the desired compression type set.
      */
     Qcow2CompressionType compression_type;
+
+    uint64_t compressed_cache_size;
+    Qcow2CompressedWriteCache *compressed_cache;
 } BDRVQcow2State;
 
 typedef struct Qcow2COWRegion {
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 8e649b008e..5720591460 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -899,6 +899,11 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
                 qcow2_cache_discard(s->l2_table_cache, table);
             }
 
+            if (s->compressed_cache) {
+                qcow2_compressed_cache_co_discard(s->compressed_cache,
+                                                  cluster_offset);
+            }
+
             if (s->discard_passthrough[type]) {
                 update_refcount_discard(bs, cluster_offset, s->cluster_size);
             }
@@ -1110,6 +1115,14 @@ int64_t qcow2_alloc_bytes(BlockDriverState *bs, int size)
             }
 
             if (!offset || ROUND_UP(offset, s->cluster_size) != new_cluster) {
+                if (offset && s->compressed_cache) {
+                    /*
+                     * Previous cluster is unfinished, but we'll not write more
+                     * data to it. We should inform compressed cache.
+                     */
+                    qcow2_compressed_cache_co_set_cluster_end(
+                            s->compressed_cache, offset);
+                }
                 offset = new_cluster;
                 free_in_cluster = s->cluster_size;
             } else {
diff --git a/block/qcow2.c b/block/qcow2.c
index 5d94f45be9..3997d8c143 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -808,6 +808,13 @@ static QemuOptsList qcow2_runtime_opts = {
             .type = QEMU_OPT_NUMBER,
             .help = "Clean unused cache entries after this time (in seconds)",
         },
+        {
+            .name = QCOW2_OPT_COMPRESSED_CACHE_SIZE,
+            .type = QEMU_OPT_NUMBER,
+            .help = "The maximum size of compressed write cache in bytes. If "
+                    "positive must be not less than cluster size. 0 disables "
+                    "the feature. Default is 64 * cluster_size",
+        },
         BLOCK_CRYPTO_OPT_DEF_KEY_SECRET("encrypt.",
             "ID of secret providing qcow2 AES key or LUKS passphrase"),
         { /* end of list */ }
@@ -969,6 +976,38 @@ typedef struct Qcow2ReopenState {
     QCryptoBlockOpenOptions *crypto_opts; /* Disk encryption runtime options */
 } Qcow2ReopenState;
 
+static int qcow2_update_compressed_cache_option(BlockDriverState *bs,
+                                                QemuOpts *opts, Error **errp)
+{
+    BDRVQcow2State *s = bs->opaque;
+    uint64_t new_size;
+    int ret;
+
+    new_size = qemu_opt_get_size(opts, QCOW2_OPT_COMPRESSED_CACHE_SIZE,
+                                 64 * s->cluster_size);
+    if (new_size > 0 && new_size < s->cluster_size) {
+        error_setg(errp, "compressed cache size is too small");
+        return -EINVAL;
+    }
+
+    if (s->compressed_cache && !new_size) {
+        ret = qcow2_compressed_cache_stop_flush(bs, s->compressed_cache);
+        if (ret < 0) {
+            error_report("Failed to flush the compressed write cache: %s",
+                         strerror(-ret));
+            return ret;
+        }
+        qcow2_compressed_cache_free(s->compressed_cache);
+        s->compressed_cache = NULL;
+    } else if (s->compressed_cache) {
+        qcow2_compressed_cache_set_size(s->compressed_cache, new_size);
+    }
+
+    s->compressed_cache_size = new_size;
+
+    return 0;
+}
+
 static int qcow2_update_options_prepare(BlockDriverState *bs,
                                         Qcow2ReopenState *r,
                                         QDict *options, int flags,
@@ -994,6 +1033,11 @@ static int qcow2_update_options_prepare(BlockDriverState *bs,
         goto fail;
     }
 
+    ret = qcow2_update_compressed_cache_option(bs, opts, errp);
+    if (ret < 0) {
+        goto fail;
+    }
+
     /* get L2 table/refcount block cache size from command line options */
     read_cache_sizes(bs, opts, &l2_cache_size, &l2_cache_entry_size,
                      &refcount_cache_size, &local_err);
@@ -2660,6 +2704,17 @@ static int qcow2_inactivate(BlockDriverState *bs)
                           bdrv_get_device_or_node_name(bs));
     }
 
+    if (s->compressed_cache) {
+        ret = qcow2_compressed_cache_stop_flush(bs, s->compressed_cache);
+        if (ret < 0) {
+            result = ret;
+            error_report("Failed to flush the compressed write cache: %s",
+                         strerror(-ret));
+        }
+        qcow2_compressed_cache_free(s->compressed_cache);
+        s->compressed_cache = NULL;
+    }
+
     ret = qcow2_cache_flush(bs, s->l2_table_cache);
     if (ret) {
         result = ret;
@@ -2692,6 +2747,8 @@ static void qcow2_close(BlockDriverState *bs)
         qcow2_inactivate(bs);
     }
 
+    assert(!s->compressed_cache);
+
     cache_clean_timer_del(bs);
     qcow2_cache_destroy(s->l2_table_cache);
     qcow2_cache_destroy(s->refcount_block_cache);
@@ -4539,8 +4596,14 @@ qcow2_co_pwritev_compressed_task(BlockDriverState *bs,
         goto fail;
     }
 
-    BLKDBG_EVENT(s->data_file, BLKDBG_WRITE_COMPRESSED);
-    ret = bdrv_co_pwrite(s->data_file, cluster_offset, out_len, out_buf, 0);
+    if (s->compressed_cache) {
+        ret = qcow2_compressed_cache_co_write(s->compressed_cache,
+                                              cluster_offset, out_len, out_buf);
+        out_buf = NULL;
+    } else {
+        BLKDBG_EVENT(s->data_file, BLKDBG_WRITE_COMPRESSED);
+        ret = bdrv_co_pwrite(s->data_file, cluster_offset, out_len, out_buf, 0);
+    }
     if (ret < 0) {
         goto fail;
     }
@@ -4601,6 +4664,12 @@ qcow2_co_pwritev_compressed_part(BlockDriverState *bs,
         return -EINVAL;
     }
 
+    if (!s->compressed_cache && s->compressed_cache_size) {
+        s->compressed_cache =
+            qcow2_compressed_cache_new(s->data_file, s->cluster_size,
+                                       s->compressed_cache_size);
+    }
+
     while (bytes && aio_task_pool_status(aio) == 0) {
         uint64_t chunk_size = MIN(bytes, s->cluster_size);
 
@@ -4656,7 +4725,12 @@ qcow2_co_preadv_compressed(BlockDriverState *bs,
     out_buf = qemu_blockalign(bs, s->cluster_size);
 
     BLKDBG_EVENT(bs->file, BLKDBG_READ_COMPRESSED);
-    ret = bdrv_co_pread(bs->file, coffset, csize, buf, 0);
+    if (s->compressed_cache) {
+        ret = qcow2_compressed_cache_co_read(s->compressed_cache,
+                                             coffset, csize, buf);
+    } else {
+        ret = bdrv_co_pread(bs->file, coffset, csize, buf, 0);
+    }
     if (ret < 0) {
         goto fail;
     }
@@ -4875,6 +4949,13 @@ static coroutine_fn int qcow2_co_flush_to_os(BlockDriverState *bs)
     BDRVQcow2State *s = bs->opaque;
     int ret;
 
+    if (s->compressed_cache) {
+        ret = qcow2_compressed_cache_flush(bs, s->compressed_cache);
+        if (ret < 0) {
+            return ret;
+        }
+    }
+
     qemu_co_mutex_lock(&s->lock);
     ret = qcow2_write_caches(bs);
     qemu_co_mutex_unlock(&s->lock);
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 4/7] simplebench: bench_one(): add slow_limit argument
  2021-01-29 16:50 [PATCH 0/7] qcow2: compressed write cache Vladimir Sementsov-Ogievskiy
                   ` (2 preceding siblings ...)
  2021-01-29 16:50 ` [PATCH 3/7] block/qcow2: use compressed write cache Vladimir Sementsov-Ogievskiy
@ 2021-01-29 16:50 ` Vladimir Sementsov-Ogievskiy
  2021-01-29 16:50 ` [PATCH 5/7] simplebench: bench_one(): support count=1 Vladimir Sementsov-Ogievskiy
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-01-29 16:50 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, vsementsov, armbru, qemu-devel, mreitz, den

Sometimes one of cells in a testing table runs too slow. And we really
don't want to wait so long. Limit number of runs in this case.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 scripts/simplebench/simplebench.py | 29 +++++++++++++++++++++++++----
 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/scripts/simplebench/simplebench.py b/scripts/simplebench/simplebench.py
index f61513af90..78d90faf4d 100644
--- a/scripts/simplebench/simplebench.py
+++ b/scripts/simplebench/simplebench.py
@@ -19,9 +19,11 @@
 #
 
 import statistics
+import time
 
 
-def bench_one(test_func, test_env, test_case, count=5, initial_run=True):
+def bench_one(test_func, test_env, test_case, count=5, initial_run=True,
+              slow_limit=100):
     """Benchmark one test-case
 
     test_func   -- benchmarking function with prototype
@@ -36,6 +38,8 @@ def bench_one(test_func, test_env, test_case, count=5, initial_run=True):
     test_case   -- test case - opaque second argument for test_func
     count       -- how many times to call test_func, to calculate average
     initial_run -- do initial run of test_func, which don't get into result
+    slow_limit  -- reduce test runs to 2, if current run exceedes the limit
+                   (in seconds)
 
     Returns dict with the following fields:
         'runs':     list of test_func results
@@ -47,17 +51,34 @@ def bench_one(test_func, test_env, test_case, count=5, initial_run=True):
         'n-failed': number of failed runs (exists only if at least one run
                     failed)
     """
+    runs = []
+    i = 0
     if initial_run:
+        t = time.time()
+
         print('  #initial run:')
-        print('   ', test_func(test_env, test_case))
+        res = test_func(test_env, test_case)
+        print('   ', res)
+
+        if time.time() - t > 50:
+            print('    - initial run is too slow, so it counts')
+            runs.append(res)
+            i = 1
+
+    for i in range(i, count):
+        t = time.time()
 
-    runs = []
-    for i in range(count):
         print('  #run {}'.format(i+1))
         res = test_func(test_env, test_case)
         print('   ', res)
         runs.append(res)
 
+        if time.time() - t > 50 and len(runs) >= 2:
+            print('    - run is too slow, and we have enough runs, stop here')
+            break
+
+    count = len(runs)
+
     result = {'runs': runs}
 
     succeeded = [r for r in runs if ('seconds' in r or 'iops' in r)]
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 5/7] simplebench: bench_one(): support count=1
  2021-01-29 16:50 [PATCH 0/7] qcow2: compressed write cache Vladimir Sementsov-Ogievskiy
                   ` (3 preceding siblings ...)
  2021-01-29 16:50 ` [PATCH 4/7] simplebench: bench_one(): add slow_limit argument Vladimir Sementsov-Ogievskiy
@ 2021-01-29 16:50 ` Vladimir Sementsov-Ogievskiy
  2021-01-29 16:50 ` [PATCH 6/7] simplebench/bench-backup: add --compressed option Vladimir Sementsov-Ogievskiy
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-01-29 16:50 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, vsementsov, armbru, qemu-devel, mreitz, den

statistics.stdev raises if sequence length is less than two. Support
that case by hand.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 scripts/simplebench/simplebench.py | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/scripts/simplebench/simplebench.py b/scripts/simplebench/simplebench.py
index 78d90faf4d..407e95cec3 100644
--- a/scripts/simplebench/simplebench.py
+++ b/scripts/simplebench/simplebench.py
@@ -92,7 +92,10 @@ def bench_one(test_func, test_env, test_case, count=5, initial_run=True,
             dim = 'seconds'
         result['dimension'] = dim
         result['average'] = statistics.mean(r[dim] for r in succeeded)
-        result['stdev'] = statistics.stdev(r[dim] for r in succeeded)
+        if len(succeeded) == 1:
+            result['stdev'] = 0
+        else:
+            result['stdev'] = statistics.stdev(r[dim] for r in succeeded)
 
     if len(succeeded) < count:
         result['n-failed'] = count - len(succeeded)
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 6/7] simplebench/bench-backup: add --compressed option
  2021-01-29 16:50 [PATCH 0/7] qcow2: compressed write cache Vladimir Sementsov-Ogievskiy
                   ` (4 preceding siblings ...)
  2021-01-29 16:50 ` [PATCH 5/7] simplebench: bench_one(): support count=1 Vladimir Sementsov-Ogievskiy
@ 2021-01-29 16:50 ` Vladimir Sementsov-Ogievskiy
  2021-01-29 16:50 ` [PATCH 7/7] simplebench/bench-backup: add target-cache argument Vladimir Sementsov-Ogievskiy
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-01-29 16:50 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, vsementsov, armbru, qemu-devel, mreitz, den

Allow bench compressed backup.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 scripts/simplebench/bench-backup.py    | 55 ++++++++++++++++++--------
 scripts/simplebench/bench_block_job.py | 23 +++++++++++
 2 files changed, 62 insertions(+), 16 deletions(-)

diff --git a/scripts/simplebench/bench-backup.py b/scripts/simplebench/bench-backup.py
index 33a1ecfefa..72eae85bb1 100755
--- a/scripts/simplebench/bench-backup.py
+++ b/scripts/simplebench/bench-backup.py
@@ -23,7 +23,7 @@ import json
 
 import simplebench
 from results_to_text import results_to_text
-from bench_block_job import bench_block_copy, drv_file, drv_nbd
+from bench_block_job import bench_block_copy, drv_file, drv_nbd, drv_qcow2
 
 
 def bench_func(env, case):
@@ -37,29 +37,41 @@ def bench_func(env, case):
 def bench(args):
     test_cases = []
 
-    sources = {}
-    targets = {}
-    for d in args.dir:
-        label, path = d.split(':')  # paths with colon not supported
-        sources[label] = drv_file(path + '/test-source')
-        targets[label] = drv_file(path + '/test-target')
+    # paths with colon not supported, so we just split by ':'
+    dirs = dict(d.split(':') for d in args.dir)
 
+    nbd_drv = None
     if args.nbd:
         nbd = args.nbd.split(':')
         host = nbd[0]
         port = '10809' if len(nbd) == 1 else nbd[1]
-        drv = drv_nbd(host, port)
-        sources['nbd'] = drv
-        targets['nbd'] = drv
+        nbd_drv = drv_nbd(host, port)
 
     for t in args.test:
         src, dst = t.split(':')
 
-        test_cases.append({
-            'id': t,
-            'source': sources[src],
-            'target': targets[dst]
-        })
+        if src == 'nbd' and dst == 'nbd':
+            raise ValueError("Can't use 'nbd' label for both src and dst")
+
+        if (src == 'nbd' or dst == 'nbd') and not nbd_drv:
+            raise ValueError("'nbd' label used but --nbd is not given")
+
+        if src == 'nbd':
+            source = nbd_drv
+        else:
+            source = drv_file(dirs[src] + '/test-source')
+
+        if dst == 'nbd':
+            test_cases.append({'id': t, 'source': source, 'target': nbd_drv})
+            continue
+
+        fname = dirs[dst] + '/test-target'
+        if args.compressed:
+            fname += '.qcow2'
+        target = drv_file(fname)
+        if args.compressed:
+            target = drv_qcow2(target)
+        test_cases.append({'id': t, 'source': source, 'target': target})
 
     binaries = []  # list of (<label>, <path>, [<options>])
     for i, q in enumerate(args.env):
@@ -106,6 +118,13 @@ def bench(args):
             elif opt.startswith('max-workers='):
                 x_perf['max-workers'] = int(opt.split('=')[1])
 
+        backup_options = {}
+        if x_perf:
+            backup_options['x-perf'] = x_perf
+
+        if args.compressed:
+            backup_options['compress'] = True
+
         if is_mirror:
             assert not x_perf
             test_envs.append({
@@ -117,7 +136,7 @@ def bench(args):
             test_envs.append({
                 'id': f'backup({label})\n' + '\n'.join(opts),
                 'cmd': 'blockdev-backup',
-                'cmd-options': {'x-perf': x_perf} if x_perf else {},
+                'cmd-options': backup_options,
                 'qemu-binary': path
             })
 
@@ -163,5 +182,9 @@ default port 10809). Use it in tests, label is "nbd"
     p.add_argument('--test', nargs='+', help='''\
 Tests, in form source-dir-label:target-dir-label''',
                    action=ExtendAction)
+    p.add_argument('--compressed', help='''\
+Use compressed backup. It automatically means
+automatically creating qcow2 target with
+lazy_refcounts for each test run''', action='store_true')
 
     bench(p.parse_args())
diff --git a/scripts/simplebench/bench_block_job.py b/scripts/simplebench/bench_block_job.py
index 7332845c1c..08f86ed9c1 100755
--- a/scripts/simplebench/bench_block_job.py
+++ b/scripts/simplebench/bench_block_job.py
@@ -21,6 +21,7 @@
 
 import sys
 import os
+import subprocess
 import socket
 import json
 
@@ -77,11 +78,29 @@ def bench_block_job(cmd, cmd_args, qemu_args):
     return {'seconds': (end_ms - start_ms) / 1000000.0}
 
 
+def get_image_size(path):
+    out = subprocess.run(['qemu-img', 'info', '--out=json', path],
+                         stdout=subprocess.PIPE, check=True).stdout
+    return json.loads(out)['virtual-size']
+
+
 # Bench backup or mirror
 def bench_block_copy(qemu_binary, cmd, cmd_options, source, target):
     """Helper to run bench_block_job() for mirror or backup"""
     assert cmd in ('blockdev-backup', 'blockdev-mirror')
 
+    if target['driver'] == 'qcow2':
+        try:
+            os.remove(target['file']['filename'])
+        except OSError:
+            pass
+
+        subprocess.run(['qemu-img', 'create', '-f', 'qcow2',
+                        target['file']['filename'],
+                        str(get_image_size(source['filename']))],
+                       stdout=subprocess.DEVNULL,
+                       stderr=subprocess.DEVNULL, check=True)
+
     source['node-name'] = 'source'
     target['node-name'] = 'target'
 
@@ -106,6 +125,10 @@ def drv_nbd(host, port):
             'server': {'type': 'inet', 'host': host, 'port': port}}
 
 
+def drv_qcow2(file):
+    return {'driver': 'qcow2', 'file': file}
+
+
 if __name__ == '__main__':
     import sys
 
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 7/7] simplebench/bench-backup: add target-cache argument
  2021-01-29 16:50 [PATCH 0/7] qcow2: compressed write cache Vladimir Sementsov-Ogievskiy
                   ` (5 preceding siblings ...)
  2021-01-29 16:50 ` [PATCH 6/7] simplebench/bench-backup: add --compressed option Vladimir Sementsov-Ogievskiy
@ 2021-01-29 16:50 ` Vladimir Sementsov-Ogievskiy
  2021-01-29 17:30 ` [PATCH 0/7] qcow2: compressed write cache no-reply
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-01-29 16:50 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, vsementsov, armbru, qemu-devel, mreitz, den

Allow benchmark with different kinds of target cache.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 scripts/simplebench/bench-backup.py    | 33 ++++++++++++++++++++------
 scripts/simplebench/bench_block_job.py | 10 +++++---
 2 files changed, 33 insertions(+), 10 deletions(-)

diff --git a/scripts/simplebench/bench-backup.py b/scripts/simplebench/bench-backup.py
index 72eae85bb1..fbc85f266f 100755
--- a/scripts/simplebench/bench-backup.py
+++ b/scripts/simplebench/bench-backup.py
@@ -65,13 +65,26 @@ def bench(args):
             test_cases.append({'id': t, 'source': source, 'target': nbd_drv})
             continue
 
-        fname = dirs[dst] + '/test-target'
-        if args.compressed:
-            fname += '.qcow2'
-        target = drv_file(fname)
-        if args.compressed:
-            target = drv_qcow2(target)
-        test_cases.append({'id': t, 'source': source, 'target': target})
+        if args.target_cache == 'both':
+            target_caches = ['direct', 'cached']
+        else:
+            target_caches = [args.target_cache]
+
+        for c in target_caches:
+            o_direct = c == 'direct'
+            fname = dirs[dst] + '/test-target'
+            if args.compressed:
+                fname += '.qcow2'
+            target = drv_file(fname, o_direct=o_direct)
+            if args.compressed:
+                target = drv_qcow2(target)
+
+            test_id = t
+            if args.target_cache == 'both':
+                test_id += f'({c})'
+
+            test_cases.append({'id': test_id, 'source': source,
+                               'target': target})
 
     binaries = []  # list of (<label>, <path>, [<options>])
     for i, q in enumerate(args.env):
@@ -186,5 +199,11 @@ Tests, in form source-dir-label:target-dir-label''',
 Use compressed backup. It automatically means
 automatically creating qcow2 target with
 lazy_refcounts for each test run''', action='store_true')
+    p.add_argument('--target-cache', help='''\
+Setup cache for target nodes. Options:
+   direct: default, use O_DIRECT and aio=native
+   cached: use system cache (Qemu default) and aio=threads (Qemu default)
+   both: generate two test cases for each src:dst pair''',
+                   default='direct', choices=('direct', 'cached', 'both'))
 
     bench(p.parse_args())
diff --git a/scripts/simplebench/bench_block_job.py b/scripts/simplebench/bench_block_job.py
index 08f86ed9c1..8f8385ccce 100755
--- a/scripts/simplebench/bench_block_job.py
+++ b/scripts/simplebench/bench_block_job.py
@@ -115,9 +115,13 @@ def bench_block_copy(qemu_binary, cmd, cmd_options, source, target):
                             '-blockdev', json.dumps(target)])
 
 
-def drv_file(filename):
-    return {'driver': 'file', 'filename': filename,
-            'cache': {'direct': True}, 'aio': 'native'}
+def drv_file(filename, o_direct=True):
+    node = {'driver': 'file', 'filename': filename}
+    if o_direct:
+        node['cache'] = {'direct': True}
+        node['aio'] = 'native'
+
+    return node
 
 
 def drv_nbd(host, port):
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/7] qcow2: compressed write cache
  2021-01-29 16:50 [PATCH 0/7] qcow2: compressed write cache Vladimir Sementsov-Ogievskiy
                   ` (6 preceding siblings ...)
  2021-01-29 16:50 ` [PATCH 7/7] simplebench/bench-backup: add target-cache argument Vladimir Sementsov-Ogievskiy
@ 2021-01-29 17:30 ` no-reply
  2021-02-01  8:24   ` Vladimir Sementsov-Ogievskiy
  2021-02-09 13:25 ` Max Reitz
  2021-02-10 12:35 ` Kevin Wolf
  9 siblings, 1 reply; 32+ messages in thread
From: no-reply @ 2021-01-29 17:30 UTC (permalink / raw)
  To: vsementsov; +Cc: kwolf, vsementsov, qemu-block, qemu-devel, armbru, mreitz, den

Patchew URL: https://patchew.org/QEMU/20210129165030.640169-1-vsementsov@virtuozzo.com/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 20210129165030.640169-1-vsementsov@virtuozzo.com
Subject: [PATCH 0/7] qcow2: compressed write cache

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
   5101d00..3701c07  master     -> master
 - [tag update]      patchew/20210129110012.8660-1-peter.maydell@linaro.org -> patchew/20210129110012.8660-1-peter.maydell@linaro.org
 * [new tag]         patchew/20210129165030.640169-1-vsementsov@virtuozzo.com -> patchew/20210129165030.640169-1-vsementsov@virtuozzo.com
Switched to a new branch 'test'
d7783a4 simplebench/bench-backup: add target-cache argument
ddf4442 simplebench/bench-backup: add --compressed option
47ee627 simplebench: bench_one(): support count=1
2b80e33 simplebench: bench_one(): add slow_limit argument
acf2fb6 block/qcow2: use compressed write cache
d96e35f block/qcow2: introduce cache for compressed writes
0d009e1 qemu/queue: add some useful QLIST_ and QTAILQ_ macros

=== OUTPUT BEGIN ===
1/7 Checking commit 0d009e16280d (qemu/queue: add some useful QLIST_ and QTAILQ_ macros)
ERROR: spaces required around that '*' (ctx:WxV)
#25: FILE: include/qemu/queue.h:177:
+    typeof(*QLIST_FIRST(head)) *qffs_var, *qffs_next_var;               \
                                ^

WARNING: Block comments use a leading /* on a separate line
#29: FILE: include/qemu/queue.h:181:
+} while (/*CONSTCOND*/0)

ERROR: spaces required around that '*' (ctx:WxV)
#39: FILE: include/qemu/queue.h:501:
+    typeof(*QTAILQ_FIRST(head)) *qffs_var, *qffs_next_var;              \
                                 ^

WARNING: Block comments use a leading /* on a separate line
#43: FILE: include/qemu/queue.h:505:
+} while (/*CONSTCOND*/0)

total: 2 errors, 2 warnings, 26 lines checked

Patch 1/7 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

2/7 Checking commit d96e35f1544f (block/qcow2: introduce cache for compressed writes)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#34: 
new file mode 100644

total: 0 errors, 1 warnings, 816 lines checked

Patch 2/7 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
3/7 Checking commit acf2fb60d5c2 (block/qcow2: use compressed write cache)
4/7 Checking commit 2b80e334b300 (simplebench: bench_one(): add slow_limit argument)
5/7 Checking commit 47ee62768317 (simplebench: bench_one(): support count=1)
6/7 Checking commit ddf44420a9e8 (simplebench/bench-backup: add --compressed option)
7/7 Checking commit d7783a45ee7b (simplebench/bench-backup: add target-cache argument)
=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/20210129165030.640169-1-vsementsov@virtuozzo.com/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/7] qcow2: compressed write cache
  2021-01-29 17:30 ` [PATCH 0/7] qcow2: compressed write cache no-reply
@ 2021-02-01  8:24   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-02-01  8:24 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, den, armbru, qemu-block, mreitz

29.01.2021 20:30, no-reply@patchew.org wrote:
> Patchew URL: https://patchew.org/QEMU/20210129165030.640169-1-vsementsov@virtuozzo.com/
> 
> 
> 
> Hi,
> 
> This series seems to have some coding style problems. See output below for
> more information:
> 
> Type: series
> Message-id: 20210129165030.640169-1-vsementsov@virtuozzo.com
> Subject: [PATCH 0/7] qcow2: compressed write cache
> 
> === TEST SCRIPT BEGIN ===
> #!/bin/bash
> git rev-parse base > /dev/null || exit 0
> git config --local diff.renamelimit 0
> git config --local diff.renames True
> git config --local diff.algorithm histogram
> ./scripts/checkpatch.pl --mailback base..
> === TEST SCRIPT END ===
> 
> Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
>  From https://github.com/patchew-project/qemu
>     5101d00..3701c07  master     -> master
>   - [tag update]      patchew/20210129110012.8660-1-peter.maydell@linaro.org -> patchew/20210129110012.8660-1-peter.maydell@linaro.org
>   * [new tag]         patchew/20210129165030.640169-1-vsementsov@virtuozzo.com -> patchew/20210129165030.640169-1-vsementsov@virtuozzo.com
> Switched to a new branch 'test'
> d7783a4 simplebench/bench-backup: add target-cache argument
> ddf4442 simplebench/bench-backup: add --compressed option
> 47ee627 simplebench: bench_one(): support count=1
> 2b80e33 simplebench: bench_one(): add slow_limit argument
> acf2fb6 block/qcow2: use compressed write cache
> d96e35f block/qcow2: introduce cache for compressed writes
> 0d009e1 qemu/queue: add some useful QLIST_ and QTAILQ_ macros
> 
> === OUTPUT BEGIN ===
> 1/7 Checking commit 0d009e16280d (qemu/queue: add some useful QLIST_ and QTAILQ_ macros)
> ERROR: spaces required around that '*' (ctx:WxV)
> #25: FILE: include/qemu/queue.h:177:
> +    typeof(*QLIST_FIRST(head)) *qffs_var, *qffs_next_var;               \
>                                  ^
> 
> WARNING: Block comments use a leading /* on a separate line
> #29: FILE: include/qemu/queue.h:181:
> +} while (/*CONSTCOND*/0)
> 
> ERROR: spaces required around that '*' (ctx:WxV)
> #39: FILE: include/qemu/queue.h:501:
> +    typeof(*QTAILQ_FIRST(head)) *qffs_var, *qffs_next_var;              \
>                                   ^
> 
> WARNING: Block comments use a leading /* on a separate line
> #43: FILE: include/qemu/queue.h:505:
> +} while (/*CONSTCOND*/0)
> 


all false positive


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/7] qemu/queue: add some useful QLIST_ and QTAILQ_ macros
  2021-01-29 16:50 ` [PATCH 1/7] qemu/queue: add some useful QLIST_ and QTAILQ_ macros Vladimir Sementsov-Ogievskiy
@ 2021-02-01  8:29   ` Markus Armbruster
  2021-02-01  8:34     ` Vladimir Sementsov-Ogievskiy
  2021-02-10 17:07   ` Max Reitz
  1 sibling, 1 reply; 32+ messages in thread
From: Markus Armbruster @ 2021-02-01  8:29 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy; +Cc: kwolf, den, qemu-devel, qemu-block, mreitz

Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> writes:

> Add QLIST_FOREACH_FUNC_SAFE(), QTAILQ_FOREACH_FUNC_SAFE() and
> QTAILQ_POP_HEAD(), to be used in following commit.
>
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  include/qemu/queue.h | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
>
> diff --git a/include/qemu/queue.h b/include/qemu/queue.h
> index e029e7bf66..03e1fce85f 100644
> --- a/include/qemu/queue.h
> +++ b/include/qemu/queue.h
> @@ -173,6 +173,13 @@ struct {                                                                \
>                  (var) && ((next_var) = ((var)->field.le_next), 1);      \
>                  (var) = (next_var))
>  
> +#define QLIST_FOREACH_FUNC_SAFE(head, field, func) do {                 \
> +    typeof(*QLIST_FIRST(head)) *qffs_var, *qffs_next_var;               \
> +    QLIST_FOREACH_SAFE(qffs_var, (head), field, qffs_next_var) {        \
> +        (func)(qffs_var);                                               \
> +    }                                                                   \
> +} while (/*CONSTCOND*/0)
> +
>  /*
>   * List access methods.
>   */
> @@ -490,6 +497,13 @@ union {                                                                 \
>               (var) && ((prev_var) = QTAILQ_PREV(var, field), 1);        \
>               (var) = (prev_var))
>  
> +#define QTAILQ_FOREACH_FUNC_SAFE(head, field, func) do {                \
> +    typeof(*QTAILQ_FIRST(head)) *qffs_var, *qffs_next_var;              \
> +    QTAILQ_FOREACH_SAFE(qffs_var, (head), field, qffs_next_var) {       \
> +        (func)(qffs_var);                                               \
> +    }                                                                   \
> +} while (/*CONSTCOND*/0)
> +
>  /*
>   * Tail queue access methods.
>   */

I wonder whether these are worth having.  Can you show the difference
they make in the next patch?



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/7] qemu/queue: add some useful QLIST_ and QTAILQ_ macros
  2021-02-01  8:29   ` Markus Armbruster
@ 2021-02-01  8:34     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-02-01  8:34 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: kwolf, den, qemu-devel, qemu-block, mreitz

01.02.2021 11:29, Markus Armbruster wrote:
> Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> writes:
> 
>> Add QLIST_FOREACH_FUNC_SAFE(), QTAILQ_FOREACH_FUNC_SAFE() and
>> QTAILQ_POP_HEAD(), to be used in following commit.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   include/qemu/queue.h | 14 ++++++++++++++
>>   1 file changed, 14 insertions(+)
>>
>> diff --git a/include/qemu/queue.h b/include/qemu/queue.h
>> index e029e7bf66..03e1fce85f 100644
>> --- a/include/qemu/queue.h
>> +++ b/include/qemu/queue.h
>> @@ -173,6 +173,13 @@ struct {                                                                \
>>                   (var) && ((next_var) = ((var)->field.le_next), 1);      \
>>                   (var) = (next_var))
>>   
>> +#define QLIST_FOREACH_FUNC_SAFE(head, field, func) do {                 \
>> +    typeof(*QLIST_FIRST(head)) *qffs_var, *qffs_next_var;               \
>> +    QLIST_FOREACH_SAFE(qffs_var, (head), field, qffs_next_var) {        \
>> +        (func)(qffs_var);                                               \
>> +    }                                                                   \
>> +} while (/*CONSTCOND*/0)
>> +
>>   /*
>>    * List access methods.
>>    */
>> @@ -490,6 +497,13 @@ union {                                                                 \
>>                (var) && ((prev_var) = QTAILQ_PREV(var, field), 1);        \
>>                (var) = (prev_var))
>>   
>> +#define QTAILQ_FOREACH_FUNC_SAFE(head, field, func) do {                \
>> +    typeof(*QTAILQ_FIRST(head)) *qffs_var, *qffs_next_var;              \
>> +    QTAILQ_FOREACH_SAFE(qffs_var, (head), field, qffs_next_var) {       \
>> +        (func)(qffs_var);                                               \
>> +    }                                                                   \
>> +} while (/*CONSTCOND*/0)
>> +
>>   /*
>>    * Tail queue access methods.
>>    */
> 
> I wonder whether these are worth having.  Can you show the difference
> they make in the next patch?
> 

Not big difference, so I can easily drop them. But I think it's a good idea and can be reused.. Why don't you like it?

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/7] qcow2: compressed write cache
  2021-01-29 16:50 [PATCH 0/7] qcow2: compressed write cache Vladimir Sementsov-Ogievskiy
                   ` (7 preceding siblings ...)
  2021-01-29 17:30 ` [PATCH 0/7] qcow2: compressed write cache no-reply
@ 2021-02-09 13:25 ` Max Reitz
  2021-02-09 14:10   ` Vladimir Sementsov-Ogievskiy
  2021-02-10 12:35 ` Kevin Wolf
  9 siblings, 1 reply; 32+ messages in thread
From: Max Reitz @ 2021-02-09 13:25 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: kwolf, den, qemu-devel, armbru

On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
> Hi all!
> 
> I know, I have several series waiting for a resend, but I had to switch
> to another task spawned from our customer's bug.
> 
> Original problem: we use O_DIRECT for all vm images in our product, it's
> the policy. The only exclusion is backup target qcow2 image for
> compressed backup, because compressed backup is extremely slow with
> O_DIRECT (due to unaligned writes). Customer complains that backup
> produces a lot of pagecache.
> 
> So we can either implement some internal cache or use fadvise somehow.
> Backup has several async workes, which writes simultaneously, so in both
> ways we have to track host cluster filling (before dropping the cache
> corresponding to the cluster).  So, if we have to track anyway, let's
> try to implement the cache.

I wanted to be excited here, because that sounds like it would be very 
easy to implement caching.  Like, just keep the cluster at 
free_byte_offset cached until the cluster it points to changes, then 
flush the cluster.

But then I see like 900 new lines of code, and I’m much less excited...

> Idea is simple: cache small unaligned write and flush the cluster when
> filled.
> 
> Performance result is very good (results in a table is time of
> compressed backup of 1000M disk filled with ones in seconds):

“Filled with ones” really is an edge case, though.

> ---------------  -----------  -----------
>                   backup(old)  backup(new)
> ssd:hdd(direct)  3e+02        4.4
>                                  -99%
> ssd:hdd(cached)  5.7          5.4
>                                  -5%
> ---------------  -----------  -----------
> 
> So, we have benefit even for cached mode! And the fastest thing is
> O_DIRECT with new implemented cache. So, I suggest to enable the new
> cache by default (which is done by the series).

First, I’m not sure how O_DIRECT really is relevant, because I don’t 
really see the point for writing compressed images.

Second, I find it a bit cheating if you say there is a huge improvement 
for the no-cache case, when actually, well, you just added a cache.  So 
the no-cache case just became faster because there is a cache now.

Well, I suppose I could follow that if O_DIRECT doesn’t make much sense 
for compressed images, qemu’s format drivers are free to introduce some 
caching (because technically the cache.direct option only applies to the 
protocol driver) for collecting compressed writes.  That conclusion 
makes both of my complaints kind of moot.

*shrug*

Third, what is the real-world impact on the page cache?  You described 
that that’s the reason why you need the cache in qemu, because otherwise 
the page cache is polluted too much.  How much is the difference really? 
  (I don’t know how good the compression ratio is for real-world images.)

Related to that, I remember a long time ago we had some discussion about 
letting qemu-img convert set a special cache mode for the target image 
that would make Linux drop everything before the last offset written 
(i.e., I suppose fadvise() with POSIX_FADV_SEQUENTIAL).  You discard 
that idea based on the fact that implementing a cache in qemu would be 
simple, but it isn’t, really.  What would the impact of 
POSIX_FADV_SEQUENTIAL be?  (One advantage of using that would be that we 
could reuse it for non-compressed images that are written by backup or 
qemu-img convert.)

(I don’t remember why that qemu-img discussion died back then.)


Fourth, regarding the code, would it be simpler if it were a pure write 
cache?  I.e., on read, everything is flushed, so we don’t have to deal 
with that.  I don’t think there are many valid cases where a compressed 
image is both written to and read from at the same time.  (Just asking, 
because I’d really want this code to be simpler.  I can imagine that 
reading from the cache is the least bit of complexity, but perhaps...)

Max



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/7] qcow2: compressed write cache
  2021-02-09 13:25 ` Max Reitz
@ 2021-02-09 14:10   ` Vladimir Sementsov-Ogievskiy
  2021-02-09 14:47     ` Max Reitz
  0 siblings, 1 reply; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-02-09 14:10 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, eblake, armbru, kwolf, den

09.02.2021 16:25, Max Reitz wrote:
> On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
>> Hi all!
>>
>> I know, I have several series waiting for a resend, but I had to switch
>> to another task spawned from our customer's bug.
>>
>> Original problem: we use O_DIRECT for all vm images in our product, it's
>> the policy. The only exclusion is backup target qcow2 image for
>> compressed backup, because compressed backup is extremely slow with
>> O_DIRECT (due to unaligned writes). Customer complains that backup
>> produces a lot of pagecache.
>>
>> So we can either implement some internal cache or use fadvise somehow.
>> Backup has several async workes, which writes simultaneously, so in both
>> ways we have to track host cluster filling (before dropping the cache
>> corresponding to the cluster).  So, if we have to track anyway, let's
>> try to implement the cache.
> 
> I wanted to be excited here, because that sounds like it would be very easy to implement caching.  Like, just keep the cluster at free_byte_offset cached until the cluster it points to changes, then flush the cluster.

The problem is that chunks are written asynchronously.. That's why this all is not so easy.

> 
> But then I see like 900 new lines of code, and I’m much less excited...
> 
>> Idea is simple: cache small unaligned write and flush the cluster when
>> filled.
>>
>> Performance result is very good (results in a table is time of
>> compressed backup of 1000M disk filled with ones in seconds):
> 
> “Filled with ones” really is an edge case, though.

Yes, I think, all clusters are compressed to rather small chunks :)

> 
>> ---------------  -----------  -----------
>>                   backup(old)  backup(new)
>> ssd:hdd(direct)  3e+02        4.4
>>                                  -99%
>> ssd:hdd(cached)  5.7          5.4
>>                                  -5%
>> ---------------  -----------  -----------
>>
>> So, we have benefit even for cached mode! And the fastest thing is
>> O_DIRECT with new implemented cache. So, I suggest to enable the new
>> cache by default (which is done by the series).
> 
> First, I’m not sure how O_DIRECT really is relevant, because I don’t really see the point for writing compressed images.

compressed backup is a point

> 
> Second, I find it a bit cheating if you say there is a huge improvement for the no-cache case, when actually, well, you just added a cache.  So the no-cache case just became faster because there is a cache now.

Still, performance comparison is relevant to show that O_DIRECT as is unusable for compressed backup.

> 
> Well, I suppose I could follow that if O_DIRECT doesn’t make much sense for compressed images, qemu’s format drivers are free to introduce some caching (because technically the cache.direct option only applies to the protocol driver) for collecting compressed writes.

Yes I thought in this way, enabling the cache by default.

> That conclusion makes both of my complaints kind of moot.
> 
> *shrug*
> 
> Third, what is the real-world impact on the page cache?  You described that that’s the reason why you need the cache in qemu, because otherwise the page cache is polluted too much.  How much is the difference really?  (I don’t know how good the compression ratio is for real-world images.)

Hm. I don't know the ratio.. Customer reported that most of RAM is polluted by Qemu's cache, and we use O_DIRECT for everything except for target of compressed backup.. Still the pollution may relate to several backups and of course it is simple enough to drop the cache after each backup. But I think that even one backup of 16T disk may pollute RAM enough.

> 
> Related to that, I remember a long time ago we had some discussion about letting qemu-img convert set a special cache mode for the target image that would make Linux drop everything before the last offset written (i.e., I suppose fadvise() with POSIX_FADV_SEQUENTIAL).  You discard that idea based on the fact that implementing a cache in qemu would be simple, but it isn’t, really.  What would the impact of POSIX_FADV_SEQUENTIAL be?  (One advantage of using that would be that we could reuse it for non-compressed images that are written by backup or qemu-img convert.)

The problem is that writes are async. And therefore, not sequential. So I have to track the writes and wait until the whole cluster is filled. It's simple use fadvise as an option to my cache: instead of caching data and write when cluster is filled we can instead mark cluster POSIX_FADV_DONTNEED.

> 
> (I don’t remember why that qemu-img discussion died back then.)
> 
> 
> Fourth, regarding the code, would it be simpler if it were a pure write cache?  I.e., on read, everything is flushed, so we don’t have to deal with that.  I don’t think there are many valid cases where a compressed image is both written to and read from at the same time.  (Just asking, because I’d really want this code to be simpler.  I can imagine that reading from the cache is the least bit of complexity, but perhaps...)
> 

Hm. I really didn't want to support reads, and do it only to make it possible to enable the cache by default.. Still read function is really simple, and I don't think that dropping it will simplify the code significantly.


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/7] qcow2: compressed write cache
  2021-02-09 14:10   ` Vladimir Sementsov-Ogievskiy
@ 2021-02-09 14:47     ` Max Reitz
  2021-02-09 16:39       ` Vladimir Sementsov-Ogievskiy
  2021-02-09 16:52       ` Denis V. Lunev
  0 siblings, 2 replies; 32+ messages in thread
From: Max Reitz @ 2021-02-09 14:47 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: kwolf, den, qemu-devel, armbru

On 09.02.21 15:10, Vladimir Sementsov-Ogievskiy wrote:
> 09.02.2021 16:25, Max Reitz wrote:
>> On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
>>> Hi all!
>>>
>>> I know, I have several series waiting for a resend, but I had to switch
>>> to another task spawned from our customer's bug.
>>>
>>> Original problem: we use O_DIRECT for all vm images in our product, it's
>>> the policy. The only exclusion is backup target qcow2 image for
>>> compressed backup, because compressed backup is extremely slow with
>>> O_DIRECT (due to unaligned writes). Customer complains that backup
>>> produces a lot of pagecache.
>>>
>>> So we can either implement some internal cache or use fadvise somehow.
>>> Backup has several async workes, which writes simultaneously, so in both
>>> ways we have to track host cluster filling (before dropping the cache
>>> corresponding to the cluster).  So, if we have to track anyway, let's
>>> try to implement the cache.
>>
>> I wanted to be excited here, because that sounds like it would be very 
>> easy to implement caching.  Like, just keep the cluster at 
>> free_byte_offset cached until the cluster it points to changes, then 
>> flush the cluster.
> 
> The problem is that chunks are written asynchronously.. That's why this 
> all is not so easy.
> 
>>
>> But then I see like 900 new lines of code, and I’m much less excited...
>>
>>> Idea is simple: cache small unaligned write and flush the cluster when
>>> filled.
>>>
>>> Performance result is very good (results in a table is time of
>>> compressed backup of 1000M disk filled with ones in seconds):
>>
>> “Filled with ones” really is an edge case, though.
> 
> Yes, I think, all clusters are compressed to rather small chunks :)
> 
>>
>>> ---------------  -----------  -----------
>>>                   backup(old)  backup(new)
>>> ssd:hdd(direct)  3e+02        4.4
>>>                                  -99%
>>> ssd:hdd(cached)  5.7          5.4
>>>                                  -5%
>>> ---------------  -----------  -----------
>>>
>>> So, we have benefit even for cached mode! And the fastest thing is
>>> O_DIRECT with new implemented cache. So, I suggest to enable the new
>>> cache by default (which is done by the series).
>>
>> First, I’m not sure how O_DIRECT really is relevant, because I don’t 
>> really see the point for writing compressed images.
> 
> compressed backup is a point

(Perhaps irrelevant, but just to be clear:) I meant the point of using 
O_DIRECT, which one can decide to not use for backup targets (as you 
have done already).

>> Second, I find it a bit cheating if you say there is a huge 
>> improvement for the no-cache case, when actually, well, you just added 
>> a cache.  So the no-cache case just became faster because there is a 
>> cache now.
> 
> Still, performance comparison is relevant to show that O_DIRECT as is 
> unusable for compressed backup.

(Again, perhaps irrelevant, but:) Yes, but my first point was exactly 
whether O_DIRECT is even relevant for writing compressed images.

>> Well, I suppose I could follow that if O_DIRECT doesn’t make much 
>> sense for compressed images, qemu’s format drivers are free to 
>> introduce some caching (because technically the cache.direct option 
>> only applies to the protocol driver) for collecting compressed writes.
> 
> Yes I thought in this way, enabling the cache by default.
> 
>> That conclusion makes both of my complaints kind of moot.
>>
>> *shrug*
>>
>> Third, what is the real-world impact on the page cache?  You described 
>> that that’s the reason why you need the cache in qemu, because 
>> otherwise the page cache is polluted too much.  How much is the 
>> difference really?  (I don’t know how good the compression ratio is 
>> for real-world images.)
> 
> Hm. I don't know the ratio.. Customer reported that most of RAM is 
> polluted by Qemu's cache, and we use O_DIRECT for everything except for 
> target of compressed backup.. Still the pollution may relate to several 
> backups and of course it is simple enough to drop the cache after each 
> backup. But I think that even one backup of 16T disk may pollute RAM 
> enough.

Oh, sorry, I just realized I had a brain fart there.  I was referring to 
whether this series improves the page cache pollution.  But obviously it 
will if it allows you to re-enable O_DIRECT.

>> Related to that, I remember a long time ago we had some discussion 
>> about letting qemu-img convert set a special cache mode for the target 
>> image that would make Linux drop everything before the last offset 
>> written (i.e., I suppose fadvise() with POSIX_FADV_SEQUENTIAL).  You 
>> discard that idea based on the fact that implementing a cache in qemu 
>> would be simple, but it isn’t, really.  What would the impact of 
>> POSIX_FADV_SEQUENTIAL be?  (One advantage of using that would be that 
>> we could reuse it for non-compressed images that are written by backup 
>> or qemu-img convert.)
> 
> The problem is that writes are async. And therefore, not sequential.

In theory, yes, but all compressed writes still goes through 
qcow2_alloc_bytes() right before submitting the write, so I wonder 
whether in practice the writes aren’t usually sufficiently sequential to 
make POSIX_FADV_SEQUENTIAL work fine.

> So
> I have to track the writes and wait until the whole cluster is filled. 
> It's simple use fadvise as an option to my cache: instead of caching 
> data and write when cluster is filled we can instead mark cluster 
> POSIX_FADV_DONTNEED.
> 
>>
>> (I don’t remember why that qemu-img discussion died back then.)
>>
>>
>> Fourth, regarding the code, would it be simpler if it were a pure 
>> write cache?  I.e., on read, everything is flushed, so we don’t have 
>> to deal with that.  I don’t think there are many valid cases where a 
>> compressed image is both written to and read from at the same time.  
>> (Just asking, because I’d really want this code to be simpler.  I can 
>> imagine that reading from the cache is the least bit of complexity, 
>> but perhaps...)
>>
> 
> Hm. I really didn't want to support reads, and do it only to make it 
> possible to enable the cache by default.. Still read function is really 
> simple, and I don't think that dropping it will simplify the code 
> significantly.

That’s too bad.

So the only question I have left is what POSIX_FADV_SEQUENTIAL actually 
would do in practice.

(But even then, the premise just doesn’t motivate me sufficiently yet...)

Max



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/7] qcow2: compressed write cache
  2021-02-09 14:47     ` Max Reitz
@ 2021-02-09 16:39       ` Vladimir Sementsov-Ogievskiy
  2021-02-09 18:36         ` Vladimir Sementsov-Ogievskiy
  2021-02-09 16:52       ` Denis V. Lunev
  1 sibling, 1 reply; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-02-09 16:39 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, eblake, armbru, kwolf, den

09.02.2021 17:47, Max Reitz wrote:
> On 09.02.21 15:10, Vladimir Sementsov-Ogievskiy wrote:
>> 09.02.2021 16:25, Max Reitz wrote:
>>> On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
>>>> Hi all!
>>>>
>>>> I know, I have several series waiting for a resend, but I had to switch
>>>> to another task spawned from our customer's bug.
>>>>
>>>> Original problem: we use O_DIRECT for all vm images in our product, it's
>>>> the policy. The only exclusion is backup target qcow2 image for
>>>> compressed backup, because compressed backup is extremely slow with
>>>> O_DIRECT (due to unaligned writes). Customer complains that backup
>>>> produces a lot of pagecache.
>>>>
>>>> So we can either implement some internal cache or use fadvise somehow.
>>>> Backup has several async workes, which writes simultaneously, so in both
>>>> ways we have to track host cluster filling (before dropping the cache
>>>> corresponding to the cluster).  So, if we have to track anyway, let's
>>>> try to implement the cache.
>>>
>>> I wanted to be excited here, because that sounds like it would be very easy to implement caching.  Like, just keep the cluster at free_byte_offset cached until the cluster it points to changes, then flush the cluster.
>>
>> The problem is that chunks are written asynchronously.. That's why this all is not so easy.
>>
>>>
>>> But then I see like 900 new lines of code, and I’m much less excited...
>>>
>>>> Idea is simple: cache small unaligned write and flush the cluster when
>>>> filled.
>>>>
>>>> Performance result is very good (results in a table is time of
>>>> compressed backup of 1000M disk filled with ones in seconds):
>>>
>>> “Filled with ones” really is an edge case, though.
>>
>> Yes, I think, all clusters are compressed to rather small chunks :)
>>
>>>
>>>> ---------------  -----------  -----------
>>>>                   backup(old)  backup(new)
>>>> ssd:hdd(direct)  3e+02        4.4
>>>>                                  -99%
>>>> ssd:hdd(cached)  5.7          5.4
>>>>                                  -5%
>>>> ---------------  -----------  -----------
>>>>
>>>> So, we have benefit even for cached mode! And the fastest thing is
>>>> O_DIRECT with new implemented cache. So, I suggest to enable the new
>>>> cache by default (which is done by the series).
>>>
>>> First, I’m not sure how O_DIRECT really is relevant, because I don’t really see the point for writing compressed images.
>>
>> compressed backup is a point
> 
> (Perhaps irrelevant, but just to be clear:) I meant the point of using O_DIRECT, which one can decide to not use for backup targets (as you have done already).
> 
>>> Second, I find it a bit cheating if you say there is a huge improvement for the no-cache case, when actually, well, you just added a cache.  So the no-cache case just became faster because there is a cache now.
>>
>> Still, performance comparison is relevant to show that O_DIRECT as is unusable for compressed backup.
> 
> (Again, perhaps irrelevant, but:) Yes, but my first point was exactly whether O_DIRECT is even relevant for writing compressed images.
> 
>>> Well, I suppose I could follow that if O_DIRECT doesn’t make much sense for compressed images, qemu’s format drivers are free to introduce some caching (because technically the cache.direct option only applies to the protocol driver) for collecting compressed writes.
>>
>> Yes I thought in this way, enabling the cache by default.
>>
>>> That conclusion makes both of my complaints kind of moot.
>>>
>>> *shrug*
>>>
>>> Third, what is the real-world impact on the page cache?  You described that that’s the reason why you need the cache in qemu, because otherwise the page cache is polluted too much.  How much is the difference really?  (I don’t know how good the compression ratio is for real-world images.)
>>
>> Hm. I don't know the ratio.. Customer reported that most of RAM is polluted by Qemu's cache, and we use O_DIRECT for everything except for target of compressed backup.. Still the pollution may relate to several backups and of course it is simple enough to drop the cache after each backup. But I think that even one backup of 16T disk may pollute RAM enough.
> 
> Oh, sorry, I just realized I had a brain fart there.  I was referring to whether this series improves the page cache pollution.  But obviously it will if it allows you to re-enable O_DIRECT.
> 
>>> Related to that, I remember a long time ago we had some discussion about letting qemu-img convert set a special cache mode for the target image that would make Linux drop everything before the last offset written (i.e., I suppose fadvise() with POSIX_FADV_SEQUENTIAL).  You discard that idea based on the fact that implementing a cache in qemu would be simple, but it isn’t, really.  What would the impact of POSIX_FADV_SEQUENTIAL be?  (One advantage of using that would be that we could reuse it for non-compressed images that are written by backup or qemu-img convert.)
>>
>> The problem is that writes are async. And therefore, not sequential.
> 
> In theory, yes, but all compressed writes still goes through qcow2_alloc_bytes() right before submitting the write, so I wonder whether in practice the writes aren’t usually sufficiently sequential to make POSIX_FADV_SEQUENTIAL work fine.

Yes, allocation is sequential. But writes are not.. Reasonable, I should at least bench it. So we should set POSIX_FADV_SEQUENTIAL for the whole backup target before the backup? Will try. Still, I expect that my cache will show better performance anyway. Actually, comparing cached (by pagecache) vs my cache we have 5.7 -> 4.4, i.e. 20% faster, which is significant (still, yes, would be good to check it on more real case than all-ones).

> 
>> So
>> I have to track the writes and wait until the whole cluster is filled. It's simple use fadvise as an option to my cache: instead of caching data and write when cluster is filled we can instead mark cluster POSIX_FADV_DONTNEED.
>>
>>>
>>> (I don’t remember why that qemu-img discussion died back then.)
>>>
>>>
>>> Fourth, regarding the code, would it be simpler if it were a pure write cache?  I.e., on read, everything is flushed, so we don’t have to deal with that.  I don’t think there are many valid cases where a compressed image is both written to and read from at the same time. (Just asking, because I’d really want this code to be simpler.  I can imagine that reading from the cache is the least bit of complexity, but perhaps...)
>>>
>>
>> Hm. I really didn't want to support reads, and do it only to make it possible to enable the cache by default.. Still read function is really simple, and I don't think that dropping it will simplify the code significantly.
> 
> That’s too bad.
> 
> So the only question I have left is what POSIX_FADV_SEQUENTIAL actually would do in practice.

will check.

> 
> (But even then, the premise just doesn’t motivate me sufficiently yet...)
> 




-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/7] qcow2: compressed write cache
  2021-02-09 14:47     ` Max Reitz
  2021-02-09 16:39       ` Vladimir Sementsov-Ogievskiy
@ 2021-02-09 16:52       ` Denis V. Lunev
  2021-02-10 10:00         ` Max Reitz
  1 sibling, 1 reply; 32+ messages in thread
From: Denis V. Lunev @ 2021-02-09 16:52 UTC (permalink / raw)
  To: Max Reitz, Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, qemu-devel, armbru

On 2/9/21 5:47 PM, Max Reitz wrote:
> On 09.02.21 15:10, Vladimir Sementsov-Ogievskiy wrote:
>> 09.02.2021 16:25, Max Reitz wrote:
>>> On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
>>>> Hi all!
>>>>
>>>> I know, I have several series waiting for a resend, but I had to
>>>> switch
>>>> to another task spawned from our customer's bug.
>>>>
>>>> Original problem: we use O_DIRECT for all vm images in our product,
>>>> it's
>>>> the policy. The only exclusion is backup target qcow2 image for
>>>> compressed backup, because compressed backup is extremely slow with
>>>> O_DIRECT (due to unaligned writes). Customer complains that backup
>>>> produces a lot of pagecache.
>>>>
>>>> So we can either implement some internal cache or use fadvise somehow.
>>>> Backup has several async workes, which writes simultaneously, so in
>>>> both
>>>> ways we have to track host cluster filling (before dropping the cache
>>>> corresponding to the cluster).  So, if we have to track anyway, let's
>>>> try to implement the cache.
>>>
>>> I wanted to be excited here, because that sounds like it would be
>>> very easy to implement caching.  Like, just keep the cluster at
>>> free_byte_offset cached until the cluster it points to changes, then
>>> flush the cluster.
>>
>> The problem is that chunks are written asynchronously.. That's why
>> this all is not so easy.
>>
>>>
>>> But then I see like 900 new lines of code, and I’m much less excited...
>>>
>>>> Idea is simple: cache small unaligned write and flush the cluster when
>>>> filled.
>>>>
>>>> Performance result is very good (results in a table is time of
>>>> compressed backup of 1000M disk filled with ones in seconds):
>>>
>>> “Filled with ones” really is an edge case, though.
>>
>> Yes, I think, all clusters are compressed to rather small chunks :)
>>
>>>
>>>> ---------------  -----------  -----------
>>>>                   backup(old)  backup(new)
>>>> ssd:hdd(direct)  3e+02        4.4
>>>>                                  -99%
>>>> ssd:hdd(cached)  5.7          5.4
>>>>                                  -5%
>>>> ---------------  -----------  -----------
>>>>
>>>> So, we have benefit even for cached mode! And the fastest thing is
>>>> O_DIRECT with new implemented cache. So, I suggest to enable the new
>>>> cache by default (which is done by the series).
>>>
>>> First, I’m not sure how O_DIRECT really is relevant, because I don’t
>>> really see the point for writing compressed images.
>>
>> compressed backup is a point
>
> (Perhaps irrelevant, but just to be clear:) I meant the point of using
> O_DIRECT, which one can decide to not use for backup targets (as you
> have done already).
>
>>> Second, I find it a bit cheating if you say there is a huge
>>> improvement for the no-cache case, when actually, well, you just
>>> added a cache.  So the no-cache case just became faster because
>>> there is a cache now.
>>
>> Still, performance comparison is relevant to show that O_DIRECT as is
>> unusable for compressed backup.
>
> (Again, perhaps irrelevant, but:) Yes, but my first point was exactly
> whether O_DIRECT is even relevant for writing compressed images.
>
>>> Well, I suppose I could follow that if O_DIRECT doesn’t make much
>>> sense for compressed images, qemu’s format drivers are free to
>>> introduce some caching (because technically the cache.direct option
>>> only applies to the protocol driver) for collecting compressed writes.
>>
>> Yes I thought in this way, enabling the cache by default.
>>
>>> That conclusion makes both of my complaints kind of moot.
>>>
>>> *shrug*
>>>
>>> Third, what is the real-world impact on the page cache?  You
>>> described that that’s the reason why you need the cache in qemu,
>>> because otherwise the page cache is polluted too much.  How much is
>>> the difference really?  (I don’t know how good the compression ratio
>>> is for real-world images.)
>>
>> Hm. I don't know the ratio.. Customer reported that most of RAM is
>> polluted by Qemu's cache, and we use O_DIRECT for everything except
>> for target of compressed backup.. Still the pollution may relate to
>> several backups and of course it is simple enough to drop the cache
>> after each backup. But I think that even one backup of 16T disk may
>> pollute RAM enough.
>
> Oh, sorry, I just realized I had a brain fart there.  I was referring
> to whether this series improves the page cache pollution.  But
> obviously it will if it allows you to re-enable O_DIRECT.
>
>>> Related to that, I remember a long time ago we had some discussion
>>> about letting qemu-img convert set a special cache mode for the
>>> target image that would make Linux drop everything before the last
>>> offset written (i.e., I suppose fadvise() with
>>> POSIX_FADV_SEQUENTIAL).  You discard that idea based on the fact
>>> that implementing a cache in qemu would be simple, but it isn’t,
>>> really.  What would the impact of POSIX_FADV_SEQUENTIAL be?  (One
>>> advantage of using that would be that we could reuse it for
>>> non-compressed images that are written by backup or qemu-img convert.)
>>
>> The problem is that writes are async. And therefore, not sequential.
>
> In theory, yes, but all compressed writes still goes through
> qcow2_alloc_bytes() right before submitting the write, so I wonder
> whether in practice the writes aren’t usually sufficiently sequential
> to make POSIX_FADV_SEQUENTIAL work fine.
>
>> So
>> I have to track the writes and wait until the whole cluster is
>> filled. It's simple use fadvise as an option to my cache: instead of
>> caching data and write when cluster is filled we can instead mark
>> cluster POSIX_FADV_DONTNEED.
>>
>>>
>>> (I don’t remember why that qemu-img discussion died back then.)
>>>
>>>
>>> Fourth, regarding the code, would it be simpler if it were a pure
>>> write cache?  I.e., on read, everything is flushed, so we don’t have
>>> to deal with that.  I don’t think there are many valid cases where a
>>> compressed image is both written to and read from at the same time. 
>>> (Just asking, because I’d really want this code to be simpler.  I
>>> can imagine that reading from the cache is the least bit of
>>> complexity, but perhaps...)
>>>
>>
>> Hm. I really didn't want to support reads, and do it only to make it
>> possible to enable the cache by default.. Still read function is
>> really simple, and I don't think that dropping it will simplify the
>> code significantly.
>
> That’s too bad.
>
> So the only question I have left is what POSIX_FADV_SEQUENTIAL
> actually would do in practice.
>
> (But even then, the premise just doesn’t motivate me sufficiently yet...)
>
POSIX_FADV_SEQUENTIAL influences the amount of the read-ahead
only.

int generic_fadvise(struct file *file, loff_t offset, loff_t len, int
advice)
{
.....
    case POSIX_FADV_NORMAL:
        file->f_ra.ra_pages = bdi->ra_pages;
        spin_lock(&file->f_lock);
        file->f_mode &= ~FMODE_RANDOM;
        spin_unlock(&file->f_lock);
        break;
    case POSIX_FADV_SEQUENTIAL:
        file->f_ra.ra_pages = bdi->ra_pages * 2;
        spin_lock(&file->f_lock);
        file->f_mode &= ~FMODE_RANDOM;
        spin_unlock(&file->f_lock);
        break;

thus the only difference from the usual NORMAL mode would be
doubled amount of read-ahead pages.

Den


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/7] qcow2: compressed write cache
  2021-02-09 16:39       ` Vladimir Sementsov-Ogievskiy
@ 2021-02-09 18:36         ` Vladimir Sementsov-Ogievskiy
  2021-02-09 18:41           ` Denis V. Lunev
  0 siblings, 1 reply; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-02-09 18:36 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, eblake, armbru, kwolf, den

09.02.2021 19:39, Vladimir Sementsov-Ogievskiy wrote:
> 09.02.2021 17:47, Max Reitz wrote:
>> On 09.02.21 15:10, Vladimir Sementsov-Ogievskiy wrote:
>>> 09.02.2021 16:25, Max Reitz wrote:
>>>> On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
>>>>> Hi all!
>>>>>
>>>>> I know, I have several series waiting for a resend, but I had to switch
>>>>> to another task spawned from our customer's bug.
>>>>>
>>>>> Original problem: we use O_DIRECT for all vm images in our product, it's
>>>>> the policy. The only exclusion is backup target qcow2 image for
>>>>> compressed backup, because compressed backup is extremely slow with
>>>>> O_DIRECT (due to unaligned writes). Customer complains that backup
>>>>> produces a lot of pagecache.
>>>>>
>>>>> So we can either implement some internal cache or use fadvise somehow.
>>>>> Backup has several async workes, which writes simultaneously, so in both
>>>>> ways we have to track host cluster filling (before dropping the cache
>>>>> corresponding to the cluster).  So, if we have to track anyway, let's
>>>>> try to implement the cache.
>>>>
>>>> I wanted to be excited here, because that sounds like it would be very easy to implement caching.  Like, just keep the cluster at free_byte_offset cached until the cluster it points to changes, then flush the cluster.
>>>
>>> The problem is that chunks are written asynchronously.. That's why this all is not so easy.
>>>
>>>>
>>>> But then I see like 900 new lines of code, and I’m much less excited...
>>>>
>>>>> Idea is simple: cache small unaligned write and flush the cluster when
>>>>> filled.
>>>>>
>>>>> Performance result is very good (results in a table is time of
>>>>> compressed backup of 1000M disk filled with ones in seconds):
>>>>
>>>> “Filled with ones” really is an edge case, though.
>>>
>>> Yes, I think, all clusters are compressed to rather small chunks :)
>>>
>>>>
>>>>> ---------------  -----------  -----------
>>>>>                   backup(old)  backup(new)
>>>>> ssd:hdd(direct)  3e+02        4.4
>>>>>                                  -99%
>>>>> ssd:hdd(cached)  5.7          5.4
>>>>>                                  -5%
>>>>> ---------------  -----------  -----------
>>>>>
>>>>> So, we have benefit even for cached mode! And the fastest thing is
>>>>> O_DIRECT with new implemented cache. So, I suggest to enable the new
>>>>> cache by default (which is done by the series).
>>>>
>>>> First, I’m not sure how O_DIRECT really is relevant, because I don’t really see the point for writing compressed images.
>>>
>>> compressed backup is a point
>>
>> (Perhaps irrelevant, but just to be clear:) I meant the point of using O_DIRECT, which one can decide to not use for backup targets (as you have done already).
>>
>>>> Second, I find it a bit cheating if you say there is a huge improvement for the no-cache case, when actually, well, you just added a cache.  So the no-cache case just became faster because there is a cache now.
>>>
>>> Still, performance comparison is relevant to show that O_DIRECT as is unusable for compressed backup.
>>
>> (Again, perhaps irrelevant, but:) Yes, but my first point was exactly whether O_DIRECT is even relevant for writing compressed images.
>>
>>>> Well, I suppose I could follow that if O_DIRECT doesn’t make much sense for compressed images, qemu’s format drivers are free to introduce some caching (because technically the cache.direct option only applies to the protocol driver) for collecting compressed writes.
>>>
>>> Yes I thought in this way, enabling the cache by default.
>>>
>>>> That conclusion makes both of my complaints kind of moot.
>>>>
>>>> *shrug*
>>>>
>>>> Third, what is the real-world impact on the page cache?  You described that that’s the reason why you need the cache in qemu, because otherwise the page cache is polluted too much.  How much is the difference really?  (I don’t know how good the compression ratio is for real-world images.)
>>>
>>> Hm. I don't know the ratio.. Customer reported that most of RAM is polluted by Qemu's cache, and we use O_DIRECT for everything except for target of compressed backup.. Still the pollution may relate to several backups and of course it is simple enough to drop the cache after each backup. But I think that even one backup of 16T disk may pollute RAM enough.
>>
>> Oh, sorry, I just realized I had a brain fart there.  I was referring to whether this series improves the page cache pollution.  But obviously it will if it allows you to re-enable O_DIRECT.
>>
>>>> Related to that, I remember a long time ago we had some discussion about letting qemu-img convert set a special cache mode for the target image that would make Linux drop everything before the last offset written (i.e., I suppose fadvise() with POSIX_FADV_SEQUENTIAL).  You discard that idea based on the fact that implementing a cache in qemu would be simple, but it isn’t, really.  What would the impact of POSIX_FADV_SEQUENTIAL be?  (One advantage of using that would be that we could reuse it for non-compressed images that are written by backup or qemu-img convert.)
>>>
>>> The problem is that writes are async. And therefore, not sequential.
>>
>> In theory, yes, but all compressed writes still goes through qcow2_alloc_bytes() right before submitting the write, so I wonder whether in practice the writes aren’t usually sufficiently sequential to make POSIX_FADV_SEQUENTIAL work fine.
> 
> Yes, allocation is sequential. But writes are not.. Reasonable, I should at least bench it. So we should set POSIX_FADV_SEQUENTIAL for the whole backup target before the backup? Will try. Still, I expect that my cache will show better performance anyway. Actually, comparing cached (by pagecache) vs my cache we have 5.7 -> 4.4, i.e. 20% faster, which is significant (still, yes, would be good to check it on more real case than all-ones).
> 
>>
>>> So
>>> I have to track the writes and wait until the whole cluster is filled. It's simple use fadvise as an option to my cache: instead of caching data and write when cluster is filled we can instead mark cluster POSIX_FADV_DONTNEED.
>>>
>>>>
>>>> (I don’t remember why that qemu-img discussion died back then.)
>>>>
>>>>
>>>> Fourth, regarding the code, would it be simpler if it were a pure write cache?  I.e., on read, everything is flushed, so we don’t have to deal with that.  I don’t think there are many valid cases where a compressed image is both written to and read from at the same time. (Just asking, because I’d really want this code to be simpler.  I can imagine that reading from the cache is the least bit of complexity, but perhaps...)
>>>>
>>>
>>> Hm. I really didn't want to support reads, and do it only to make it possible to enable the cache by default.. Still read function is really simple, and I don't think that dropping it will simplify the code significantly.
>>
>> That’s too bad.
>>
>> So the only question I have left is what POSIX_FADV_SEQUENTIAL actually would do in practice.
> 
> will check.
> 

Checked that if I mark the whole file by FADV_SEQUENTIAL, cache is not removed.

Test:
[root@kvm fadvise]# cat a.c
#define _GNU_SOURCE
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <getopt.h>
#include <string.h>
#include <stdbool.h>

int main(int argc, char *argv[])
{
     int fd;
     int i;
     char mb[1024 * 1024];
     int open_flags = O_RDWR | O_CREAT | O_EXCL;
     int fadv_flags = 0;
     int fadv_final_flags = 0;
     int len = 1024 * 1024;
     bool do_fsync = false;

     for (i = 1; i < argc - 1; i++) {
         const char *arg = argv[i];

         if (!strcmp(arg, "direct")) {
             open_flags |= O_DIRECT;
         } else if (!strcmp(arg, "seq")) {
             fadv_flags = POSIX_FADV_SEQUENTIAL;
         } else if (!strcmp(arg, "dontneed")) {
             fadv_flags = POSIX_FADV_DONTNEED;
         } else if (!strcmp(arg, "final-dontneed")) {
             fadv_final_flags = POSIX_FADV_DONTNEED;
         } else if (!strcmp(arg, "fsync")) {
             do_fsync = true;
         } else {
             fprintf(stderr, "unknown: %s\n", arg);
             return 1;
         }
     }

     fd = open(argv[argc - 1], open_flags);

     if (fd < 0) {
         fprintf(stderr, "failed to open\n");
         return 1;
     }

     if (fadv_flags) {
         posix_fadvise(fd, 0, 100 * 1024 * 1024, fadv_flags);
     }
     for (i = 0; i < 100; i++) {
         write(fd, mb, len);
     }
     if (fadv_final_flags) {
         posix_fadvise(fd, 0, 100 * 1024 * 1024, fadv_final_flags);
     }
     if (do_fsync) {
         fsync(fd);
     }

     close(fd);
}



[root@kvm fadvise]# gcc a.c
[root@kvm fadvise]# rm -f x; ./a.out seq x; fincore x
   RES PAGES  SIZE FILE
  100M 25600  100M x
[root@kvm fadvise]# rm -f x; ./a.out dontneed x; fincore x
   RES PAGES  SIZE FILE
  100M 25600  100M x
[root@kvm fadvise]# rm -f x; ./a.out final-dontneed x; fincore x
   RES PAGES  SIZE FILE
   36M  9216  100M x
[root@kvm fadvise]# rm -f x; ./a.out seq fsync x; fincore x
   RES PAGES  SIZE FILE
  100M 25600  100M x
[root@kvm fadvise]# rm -f x; ./a.out dontneed fsync x; fincore x
   RES PAGES  SIZE FILE
  100M 25600  100M x
[root@kvm fadvise]# rm -f x; ./a.out final-dontneed fsync x; fincore x
   RES PAGES  SIZE FILE
   36M  9216  100M x
[root@kvm fadvise]# rm -f x; ./a.out direct x; fincore x
RES PAGES SIZE FILE
  0B     0   0B x



Backup-generated pagecache is a formal trash, it will be never used. And it's bad that it can displace another good pagecache. So the best thing for backup is direct mode + proposed cache.


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/7] qcow2: compressed write cache
  2021-02-09 18:36         ` Vladimir Sementsov-Ogievskiy
@ 2021-02-09 18:41           ` Denis V. Lunev
  2021-02-09 18:51             ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 32+ messages in thread
From: Denis V. Lunev @ 2021-02-09 18:41 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, Max Reitz, qemu-block
  Cc: kwolf, qemu-devel, armbru

On 2/9/21 9:36 PM, Vladimir Sementsov-Ogievskiy wrote:
> 09.02.2021 19:39, Vladimir Sementsov-Ogievskiy wrote:
>> 09.02.2021 17:47, Max Reitz wrote:
>>> On 09.02.21 15:10, Vladimir Sementsov-Ogievskiy wrote:
>>>> 09.02.2021 16:25, Max Reitz wrote:
>>>>> On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> Hi all!
>>>>>>
>>>>>> I know, I have several series waiting for a resend, but I had to
>>>>>> switch
>>>>>> to another task spawned from our customer's bug.
>>>>>>
>>>>>> Original problem: we use O_DIRECT for all vm images in our
>>>>>> product, it's
>>>>>> the policy. The only exclusion is backup target qcow2 image for
>>>>>> compressed backup, because compressed backup is extremely slow with
>>>>>> O_DIRECT (due to unaligned writes). Customer complains that backup
>>>>>> produces a lot of pagecache.
>>>>>>
>>>>>> So we can either implement some internal cache or use fadvise
>>>>>> somehow.
>>>>>> Backup has several async workes, which writes simultaneously, so
>>>>>> in both
>>>>>> ways we have to track host cluster filling (before dropping the
>>>>>> cache
>>>>>> corresponding to the cluster).  So, if we have to track anyway,
>>>>>> let's
>>>>>> try to implement the cache.
>>>>>
>>>>> I wanted to be excited here, because that sounds like it would be
>>>>> very easy to implement caching.  Like, just keep the cluster at
>>>>> free_byte_offset cached until the cluster it points to changes,
>>>>> then flush the cluster.
>>>>
>>>> The problem is that chunks are written asynchronously.. That's why
>>>> this all is not so easy.
>>>>
>>>>>
>>>>> But then I see like 900 new lines of code, and I’m much less
>>>>> excited...
>>>>>
>>>>>> Idea is simple: cache small unaligned write and flush the cluster
>>>>>> when
>>>>>> filled.
>>>>>>
>>>>>> Performance result is very good (results in a table is time of
>>>>>> compressed backup of 1000M disk filled with ones in seconds):
>>>>>
>>>>> “Filled with ones” really is an edge case, though.
>>>>
>>>> Yes, I think, all clusters are compressed to rather small chunks :)
>>>>
>>>>>
>>>>>> ---------------  -----------  -----------
>>>>>>                   backup(old)  backup(new)
>>>>>> ssd:hdd(direct)  3e+02        4.4
>>>>>>                                  -99%
>>>>>> ssd:hdd(cached)  5.7          5.4
>>>>>>                                  -5%
>>>>>> ---------------  -----------  -----------
>>>>>>
>>>>>> So, we have benefit even for cached mode! And the fastest thing is
>>>>>> O_DIRECT with new implemented cache. So, I suggest to enable the new
>>>>>> cache by default (which is done by the series).
>>>>>
>>>>> First, I’m not sure how O_DIRECT really is relevant, because I
>>>>> don’t really see the point for writing compressed images.
>>>>
>>>> compressed backup is a point
>>>
>>> (Perhaps irrelevant, but just to be clear:) I meant the point of
>>> using O_DIRECT, which one can decide to not use for backup targets
>>> (as you have done already).
>>>
>>>>> Second, I find it a bit cheating if you say there is a huge
>>>>> improvement for the no-cache case, when actually, well, you just
>>>>> added a cache.  So the no-cache case just became faster because
>>>>> there is a cache now.
>>>>
>>>> Still, performance comparison is relevant to show that O_DIRECT as
>>>> is unusable for compressed backup.
>>>
>>> (Again, perhaps irrelevant, but:) Yes, but my first point was
>>> exactly whether O_DIRECT is even relevant for writing compressed
>>> images.
>>>
>>>>> Well, I suppose I could follow that if O_DIRECT doesn’t make much
>>>>> sense for compressed images, qemu’s format drivers are free to
>>>>> introduce some caching (because technically the cache.direct
>>>>> option only applies to the protocol driver) for collecting
>>>>> compressed writes.
>>>>
>>>> Yes I thought in this way, enabling the cache by default.
>>>>
>>>>> That conclusion makes both of my complaints kind of moot.
>>>>>
>>>>> *shrug*
>>>>>
>>>>> Third, what is the real-world impact on the page cache?  You
>>>>> described that that’s the reason why you need the cache in qemu,
>>>>> because otherwise the page cache is polluted too much.  How much
>>>>> is the difference really?  (I don’t know how good the compression
>>>>> ratio is for real-world images.)
>>>>
>>>> Hm. I don't know the ratio.. Customer reported that most of RAM is
>>>> polluted by Qemu's cache, and we use O_DIRECT for everything except
>>>> for target of compressed backup.. Still the pollution may relate to
>>>> several backups and of course it is simple enough to drop the cache
>>>> after each backup. But I think that even one backup of 16T disk may
>>>> pollute RAM enough.
>>>
>>> Oh, sorry, I just realized I had a brain fart there.  I was
>>> referring to whether this series improves the page cache pollution. 
>>> But obviously it will if it allows you to re-enable O_DIRECT.
>>>
>>>>> Related to that, I remember a long time ago we had some discussion
>>>>> about letting qemu-img convert set a special cache mode for the
>>>>> target image that would make Linux drop everything before the last
>>>>> offset written (i.e., I suppose fadvise() with
>>>>> POSIX_FADV_SEQUENTIAL).  You discard that idea based on the fact
>>>>> that implementing a cache in qemu would be simple, but it isn’t,
>>>>> really.  What would the impact of POSIX_FADV_SEQUENTIAL be?  (One
>>>>> advantage of using that would be that we could reuse it for
>>>>> non-compressed images that are written by backup or qemu-img
>>>>> convert.)
>>>>
>>>> The problem is that writes are async. And therefore, not sequential.
>>>
>>> In theory, yes, but all compressed writes still goes through
>>> qcow2_alloc_bytes() right before submitting the write, so I wonder
>>> whether in practice the writes aren’t usually sufficiently
>>> sequential to make POSIX_FADV_SEQUENTIAL work fine.
>>
>> Yes, allocation is sequential. But writes are not.. Reasonable, I
>> should at least bench it. So we should set POSIX_FADV_SEQUENTIAL for
>> the whole backup target before the backup? Will try. Still, I expect
>> that my cache will show better performance anyway. Actually,
>> comparing cached (by pagecache) vs my cache we have 5.7 -> 4.4, i.e.
>> 20% faster, which is significant (still, yes, would be good to check
>> it on more real case than all-ones).
>>
>>>
>>>> So
>>>> I have to track the writes and wait until the whole cluster is
>>>> filled. It's simple use fadvise as an option to my cache: instead
>>>> of caching data and write when cluster is filled we can instead
>>>> mark cluster POSIX_FADV_DONTNEED.
>>>>
>>>>>
>>>>> (I don’t remember why that qemu-img discussion died back then.)
>>>>>
>>>>>
>>>>> Fourth, regarding the code, would it be simpler if it were a pure
>>>>> write cache?  I.e., on read, everything is flushed, so we don’t
>>>>> have to deal with that.  I don’t think there are many valid cases
>>>>> where a compressed image is both written to and read from at the
>>>>> same time. (Just asking, because I’d really want this code to be
>>>>> simpler.  I can imagine that reading from the cache is the least
>>>>> bit of complexity, but perhaps...)
>>>>>
>>>>
>>>> Hm. I really didn't want to support reads, and do it only to make
>>>> it possible to enable the cache by default.. Still read function is
>>>> really simple, and I don't think that dropping it will simplify the
>>>> code significantly.
>>>
>>> That’s too bad.
>>>
>>> So the only question I have left is what POSIX_FADV_SEQUENTIAL
>>> actually would do in practice.
>>
>> will check.
>>
>
> Checked that if I mark the whole file by FADV_SEQUENTIAL, cache is not
> removed.
>
> Test:
> [root@kvm fadvise]# cat a.c
> #define _GNU_SOURCE
> #include <fcntl.h>
> #include <unistd.h>
> #include <stdio.h>
> #include <getopt.h>
> #include <string.h>
> #include <stdbool.h>
>
> int main(int argc, char *argv[])
> {
>     int fd;
>     int i;
>     char mb[1024 * 1024];
>     int open_flags = O_RDWR | O_CREAT | O_EXCL;
>     int fadv_flags = 0;
>     int fadv_final_flags = 0;
>     int len = 1024 * 1024;
>     bool do_fsync = false;
>
>     for (i = 1; i < argc - 1; i++) {
>         const char *arg = argv[i];
>
>         if (!strcmp(arg, "direct")) {
>             open_flags |= O_DIRECT;
>         } else if (!strcmp(arg, "seq")) {
>             fadv_flags = POSIX_FADV_SEQUENTIAL;
>         } else if (!strcmp(arg, "dontneed")) {
>             fadv_flags = POSIX_FADV_DONTNEED;
>         } else if (!strcmp(arg, "final-dontneed")) {
>             fadv_final_flags = POSIX_FADV_DONTNEED;
>         } else if (!strcmp(arg, "fsync")) {
>             do_fsync = true;
>         } else {
>             fprintf(stderr, "unknown: %s\n", arg);
>             return 1;
>         }
>     }
>
>     fd = open(argv[argc - 1], open_flags);
>
>     if (fd < 0) {
>         fprintf(stderr, "failed to open\n");
>         return 1;
>     }
>
>     if (fadv_flags) {
>         posix_fadvise(fd, 0, 100 * 1024 * 1024, fadv_flags);
>     }
>     for (i = 0; i < 100; i++) {
>         write(fd, mb, len);
>     }
>     if (fadv_final_flags) {
>         posix_fadvise(fd, 0, 100 * 1024 * 1024, fadv_final_flags);
>     }
>     if (do_fsync) {
>         fsync(fd);
>     }
>
>     close(fd);
> }
>
>
>
> [root@kvm fadvise]# gcc a.c
> [root@kvm fadvise]# rm -f x; ./a.out seq x; fincore x
>   RES PAGES  SIZE FILE
>  100M 25600  100M x
> [root@kvm fadvise]# rm -f x; ./a.out dontneed x; fincore x
>   RES PAGES  SIZE FILE
>  100M 25600  100M x
> [root@kvm fadvise]# rm -f x; ./a.out final-dontneed x; fincore x
>   RES PAGES  SIZE FILE
>   36M  9216  100M x
> [root@kvm fadvise]# rm -f x; ./a.out seq fsync x; fincore x
>   RES PAGES  SIZE FILE
>  100M 25600  100M x
> [root@kvm fadvise]# rm -f x; ./a.out dontneed fsync x; fincore x
>   RES PAGES  SIZE FILE
>  100M 25600  100M x
> [root@kvm fadvise]# rm -f x; ./a.out final-dontneed fsync x; fincore x
>   RES PAGES  SIZE FILE
>   36M  9216  100M x
> [root@kvm fadvise]# rm -f x; ./a.out direct x; fincore x
> RES PAGES SIZE FILE
>  0B     0   0B x
>
>
>
> Backup-generated pagecache is a formal trash, it will be never used.
> And it's bad that it can displace another good pagecache. So the best
> thing for backup is direct mode + proposed cache.
>
>
I think that the original intention of Max is about POSIX_FADV_DONTNEED
One should call this fadvise for just fully written cluster. Though this is
a bit buggy and from performance point of view will be slower.

This call could be slow and thus it should be created as delegate to
co-routine. We have though on this, but the amount of work to
implement even if seems lower, is not really lower and the result
would not be as great.

Den


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/7] qcow2: compressed write cache
  2021-02-09 18:41           ` Denis V. Lunev
@ 2021-02-09 18:51             ` Vladimir Sementsov-Ogievskiy
  2021-02-10 10:00               ` Max Reitz
  0 siblings, 1 reply; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-02-09 18:51 UTC (permalink / raw)
  To: Denis V. Lunev, Max Reitz, qemu-block; +Cc: qemu-devel, eblake, armbru, kwolf

09.02.2021 21:41, Denis V. Lunev wrote:
> On 2/9/21 9:36 PM, Vladimir Sementsov-Ogievskiy wrote:
>> 09.02.2021 19:39, Vladimir Sementsov-Ogievskiy wrote:
>>> 09.02.2021 17:47, Max Reitz wrote:
>>>> On 09.02.21 15:10, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 09.02.2021 16:25, Max Reitz wrote:
>>>>>> On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>> Hi all!
>>>>>>>
>>>>>>> I know, I have several series waiting for a resend, but I had to
>>>>>>> switch
>>>>>>> to another task spawned from our customer's bug.
>>>>>>>
>>>>>>> Original problem: we use O_DIRECT for all vm images in our
>>>>>>> product, it's
>>>>>>> the policy. The only exclusion is backup target qcow2 image for
>>>>>>> compressed backup, because compressed backup is extremely slow with
>>>>>>> O_DIRECT (due to unaligned writes). Customer complains that backup
>>>>>>> produces a lot of pagecache.
>>>>>>>
>>>>>>> So we can either implement some internal cache or use fadvise
>>>>>>> somehow.
>>>>>>> Backup has several async workes, which writes simultaneously, so
>>>>>>> in both
>>>>>>> ways we have to track host cluster filling (before dropping the
>>>>>>> cache
>>>>>>> corresponding to the cluster).  So, if we have to track anyway,
>>>>>>> let's
>>>>>>> try to implement the cache.
>>>>>>
>>>>>> I wanted to be excited here, because that sounds like it would be
>>>>>> very easy to implement caching.  Like, just keep the cluster at
>>>>>> free_byte_offset cached until the cluster it points to changes,
>>>>>> then flush the cluster.
>>>>>
>>>>> The problem is that chunks are written asynchronously.. That's why
>>>>> this all is not so easy.
>>>>>
>>>>>>
>>>>>> But then I see like 900 new lines of code, and I’m much less
>>>>>> excited...
>>>>>>
>>>>>>> Idea is simple: cache small unaligned write and flush the cluster
>>>>>>> when
>>>>>>> filled.
>>>>>>>
>>>>>>> Performance result is very good (results in a table is time of
>>>>>>> compressed backup of 1000M disk filled with ones in seconds):
>>>>>>
>>>>>> “Filled with ones” really is an edge case, though.
>>>>>
>>>>> Yes, I think, all clusters are compressed to rather small chunks :)
>>>>>
>>>>>>
>>>>>>> ---------------  -----------  -----------
>>>>>>>                    backup(old)  backup(new)
>>>>>>> ssd:hdd(direct)  3e+02        4.4
>>>>>>>                                   -99%
>>>>>>> ssd:hdd(cached)  5.7          5.4
>>>>>>>                                   -5%
>>>>>>> ---------------  -----------  -----------
>>>>>>>
>>>>>>> So, we have benefit even for cached mode! And the fastest thing is
>>>>>>> O_DIRECT with new implemented cache. So, I suggest to enable the new
>>>>>>> cache by default (which is done by the series).
>>>>>>
>>>>>> First, I’m not sure how O_DIRECT really is relevant, because I
>>>>>> don’t really see the point for writing compressed images.
>>>>>
>>>>> compressed backup is a point
>>>>
>>>> (Perhaps irrelevant, but just to be clear:) I meant the point of
>>>> using O_DIRECT, which one can decide to not use for backup targets
>>>> (as you have done already).
>>>>
>>>>>> Second, I find it a bit cheating if you say there is a huge
>>>>>> improvement for the no-cache case, when actually, well, you just
>>>>>> added a cache.  So the no-cache case just became faster because
>>>>>> there is a cache now.
>>>>>
>>>>> Still, performance comparison is relevant to show that O_DIRECT as
>>>>> is unusable for compressed backup.
>>>>
>>>> (Again, perhaps irrelevant, but:) Yes, but my first point was
>>>> exactly whether O_DIRECT is even relevant for writing compressed
>>>> images.
>>>>
>>>>>> Well, I suppose I could follow that if O_DIRECT doesn’t make much
>>>>>> sense for compressed images, qemu’s format drivers are free to
>>>>>> introduce some caching (because technically the cache.direct
>>>>>> option only applies to the protocol driver) for collecting
>>>>>> compressed writes.
>>>>>
>>>>> Yes I thought in this way, enabling the cache by default.
>>>>>
>>>>>> That conclusion makes both of my complaints kind of moot.
>>>>>>
>>>>>> *shrug*
>>>>>>
>>>>>> Third, what is the real-world impact on the page cache?  You
>>>>>> described that that’s the reason why you need the cache in qemu,
>>>>>> because otherwise the page cache is polluted too much.  How much
>>>>>> is the difference really?  (I don’t know how good the compression
>>>>>> ratio is for real-world images.)
>>>>>
>>>>> Hm. I don't know the ratio.. Customer reported that most of RAM is
>>>>> polluted by Qemu's cache, and we use O_DIRECT for everything except
>>>>> for target of compressed backup.. Still the pollution may relate to
>>>>> several backups and of course it is simple enough to drop the cache
>>>>> after each backup. But I think that even one backup of 16T disk may
>>>>> pollute RAM enough.
>>>>
>>>> Oh, sorry, I just realized I had a brain fart there.  I was
>>>> referring to whether this series improves the page cache pollution.
>>>> But obviously it will if it allows you to re-enable O_DIRECT.
>>>>
>>>>>> Related to that, I remember a long time ago we had some discussion
>>>>>> about letting qemu-img convert set a special cache mode for the
>>>>>> target image that would make Linux drop everything before the last
>>>>>> offset written (i.e., I suppose fadvise() with
>>>>>> POSIX_FADV_SEQUENTIAL).  You discard that idea based on the fact
>>>>>> that implementing a cache in qemu would be simple, but it isn’t,
>>>>>> really.  What would the impact of POSIX_FADV_SEQUENTIAL be?  (One
>>>>>> advantage of using that would be that we could reuse it for
>>>>>> non-compressed images that are written by backup or qemu-img
>>>>>> convert.)
>>>>>
>>>>> The problem is that writes are async. And therefore, not sequential.
>>>>
>>>> In theory, yes, but all compressed writes still goes through
>>>> qcow2_alloc_bytes() right before submitting the write, so I wonder
>>>> whether in practice the writes aren’t usually sufficiently
>>>> sequential to make POSIX_FADV_SEQUENTIAL work fine.
>>>
>>> Yes, allocation is sequential. But writes are not.. Reasonable, I
>>> should at least bench it. So we should set POSIX_FADV_SEQUENTIAL for
>>> the whole backup target before the backup? Will try. Still, I expect
>>> that my cache will show better performance anyway. Actually,
>>> comparing cached (by pagecache) vs my cache we have 5.7 -> 4.4, i.e.
>>> 20% faster, which is significant (still, yes, would be good to check
>>> it on more real case than all-ones).
>>>
>>>>
>>>>> So
>>>>> I have to track the writes and wait until the whole cluster is
>>>>> filled. It's simple use fadvise as an option to my cache: instead
>>>>> of caching data and write when cluster is filled we can instead
>>>>> mark cluster POSIX_FADV_DONTNEED.
>>>>>
>>>>>>
>>>>>> (I don’t remember why that qemu-img discussion died back then.)
>>>>>>
>>>>>>
>>>>>> Fourth, regarding the code, would it be simpler if it were a pure
>>>>>> write cache?  I.e., on read, everything is flushed, so we don’t
>>>>>> have to deal with that.  I don’t think there are many valid cases
>>>>>> where a compressed image is both written to and read from at the
>>>>>> same time. (Just asking, because I’d really want this code to be
>>>>>> simpler.  I can imagine that reading from the cache is the least
>>>>>> bit of complexity, but perhaps...)
>>>>>>
>>>>>
>>>>> Hm. I really didn't want to support reads, and do it only to make
>>>>> it possible to enable the cache by default.. Still read function is
>>>>> really simple, and I don't think that dropping it will simplify the
>>>>> code significantly.
>>>>
>>>> That’s too bad.
>>>>
>>>> So the only question I have left is what POSIX_FADV_SEQUENTIAL
>>>> actually would do in practice.
>>>
>>> will check.
>>>
>>
>> Checked that if I mark the whole file by FADV_SEQUENTIAL, cache is not
>> removed.
>>
>> Test:
>> [root@kvm fadvise]# cat a.c
>> #define _GNU_SOURCE
>> #include <fcntl.h>
>> #include <unistd.h>
>> #include <stdio.h>
>> #include <getopt.h>
>> #include <string.h>
>> #include <stdbool.h>
>>
>> int main(int argc, char *argv[])
>> {
>>      int fd;
>>      int i;
>>      char mb[1024 * 1024];
>>      int open_flags = O_RDWR | O_CREAT | O_EXCL;
>>      int fadv_flags = 0;
>>      int fadv_final_flags = 0;
>>      int len = 1024 * 1024;
>>      bool do_fsync = false;
>>
>>      for (i = 1; i < argc - 1; i++) {
>>          const char *arg = argv[i];
>>
>>          if (!strcmp(arg, "direct")) {
>>              open_flags |= O_DIRECT;
>>          } else if (!strcmp(arg, "seq")) {
>>              fadv_flags = POSIX_FADV_SEQUENTIAL;
>>          } else if (!strcmp(arg, "dontneed")) {
>>              fadv_flags = POSIX_FADV_DONTNEED;
>>          } else if (!strcmp(arg, "final-dontneed")) {
>>              fadv_final_flags = POSIX_FADV_DONTNEED;
>>          } else if (!strcmp(arg, "fsync")) {
>>              do_fsync = true;
>>          } else {
>>              fprintf(stderr, "unknown: %s\n", arg);
>>              return 1;
>>          }
>>      }
>>
>>      fd = open(argv[argc - 1], open_flags);
>>
>>      if (fd < 0) {
>>          fprintf(stderr, "failed to open\n");
>>          return 1;
>>      }
>>
>>      if (fadv_flags) {
>>          posix_fadvise(fd, 0, 100 * 1024 * 1024, fadv_flags);
>>      }
>>      for (i = 0; i < 100; i++) {
>>          write(fd, mb, len);
>>      }
>>      if (fadv_final_flags) {
>>          posix_fadvise(fd, 0, 100 * 1024 * 1024, fadv_final_flags);
>>      }
>>      if (do_fsync) {
>>          fsync(fd);
>>      }
>>
>>      close(fd);
>> }
>>
>>
>>
>> [root@kvm fadvise]# gcc a.c
>> [root@kvm fadvise]# rm -f x; ./a.out seq x; fincore x
>>    RES PAGES  SIZE FILE
>>   100M 25600  100M x
>> [root@kvm fadvise]# rm -f x; ./a.out dontneed x; fincore x
>>    RES PAGES  SIZE FILE
>>   100M 25600  100M x
>> [root@kvm fadvise]# rm -f x; ./a.out final-dontneed x; fincore x
>>    RES PAGES  SIZE FILE
>>    36M  9216  100M x
>> [root@kvm fadvise]# rm -f x; ./a.out seq fsync x; fincore x
>>    RES PAGES  SIZE FILE
>>   100M 25600  100M x
>> [root@kvm fadvise]# rm -f x; ./a.out dontneed fsync x; fincore x
>>    RES PAGES  SIZE FILE
>>   100M 25600  100M x
>> [root@kvm fadvise]# rm -f x; ./a.out final-dontneed fsync x; fincore x
>>    RES PAGES  SIZE FILE
>>    36M  9216  100M x
>> [root@kvm fadvise]# rm -f x; ./a.out direct x; fincore x
>> RES PAGES SIZE FILE
>>   0B     0   0B x
>>
>>
>>
>> Backup-generated pagecache is a formal trash, it will be never used.
>> And it's bad that it can displace another good pagecache. So the best
>> thing for backup is direct mode + proposed cache.
>>
>>
> I think that the original intention of Max is about POSIX_FADV_DONTNEED
> One should call this fadvise for just fully written cluster.

This should work, but:

  - as we see from test above, POSIX_FADV_DONTNEED don't remove the whole cache (see final-dontneed)
  - as I said we'll have to track writes, so the cache will be the same, just instead of postponed-write operation we'll do fadvise.

Hmm. Still, in this way, we will not need some difficult things in my proposed cache.

So, it may be reasonable to at least split the big patch so that

  - first part brings writes / full-cluster tracking and fadvice

  - second part adds caching-wrties option, corresponding flush code and additional performance

Does it make sense?


> Though this is
> a bit buggy and from performance point of view will be slower.
> 
> This call could be slow and thus it should be created as delegate to
> co-routine. We have though on this, but the amount of work to
> implement even if seems lower, is not really lower and the result
> would not be as great.
> 
> Den
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/7] qcow2: compressed write cache
  2021-02-09 18:51             ` Vladimir Sementsov-Ogievskiy
@ 2021-02-10 10:00               ` Max Reitz
  2021-02-10 10:10                 ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 32+ messages in thread
From: Max Reitz @ 2021-02-10 10:00 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, Denis V. Lunev, qemu-block
  Cc: kwolf, qemu-devel, armbru

On 09.02.21 19:51, Vladimir Sementsov-Ogievskiy wrote:
> 09.02.2021 21:41, Denis V. Lunev wrote:
>> On 2/9/21 9:36 PM, Vladimir Sementsov-Ogievskiy wrote:
>>> 09.02.2021 19:39, Vladimir Sementsov-Ogievskiy wrote:
>>>> 09.02.2021 17:47, Max Reitz wrote:
>>>>> On 09.02.21 15:10, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> 09.02.2021 16:25, Max Reitz wrote:
>>>>>>> On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>>> Hi all!
>>>>>>>>
>>>>>>>> I know, I have several series waiting for a resend, but I had to
>>>>>>>> switch
>>>>>>>> to another task spawned from our customer's bug.
>>>>>>>>
>>>>>>>> Original problem: we use O_DIRECT for all vm images in our
>>>>>>>> product, it's
>>>>>>>> the policy. The only exclusion is backup target qcow2 image for
>>>>>>>> compressed backup, because compressed backup is extremely slow with
>>>>>>>> O_DIRECT (due to unaligned writes). Customer complains that backup
>>>>>>>> produces a lot of pagecache.
>>>>>>>>
>>>>>>>> So we can either implement some internal cache or use fadvise
>>>>>>>> somehow.
>>>>>>>> Backup has several async workes, which writes simultaneously, so
>>>>>>>> in both
>>>>>>>> ways we have to track host cluster filling (before dropping the
>>>>>>>> cache
>>>>>>>> corresponding to the cluster).  So, if we have to track anyway,
>>>>>>>> let's
>>>>>>>> try to implement the cache.
>>>>>>>
>>>>>>> I wanted to be excited here, because that sounds like it would be
>>>>>>> very easy to implement caching.  Like, just keep the cluster at
>>>>>>> free_byte_offset cached until the cluster it points to changes,
>>>>>>> then flush the cluster.
>>>>>>
>>>>>> The problem is that chunks are written asynchronously.. That's why
>>>>>> this all is not so easy.
>>>>>>
>>>>>>>
>>>>>>> But then I see like 900 new lines of code, and I’m much less
>>>>>>> excited...
>>>>>>>
>>>>>>>> Idea is simple: cache small unaligned write and flush the cluster
>>>>>>>> when
>>>>>>>> filled.
>>>>>>>>
>>>>>>>> Performance result is very good (results in a table is time of
>>>>>>>> compressed backup of 1000M disk filled with ones in seconds):
>>>>>>>
>>>>>>> “Filled with ones” really is an edge case, though.
>>>>>>
>>>>>> Yes, I think, all clusters are compressed to rather small chunks :)
>>>>>>
>>>>>>>
>>>>>>>> ---------------  -----------  -----------
>>>>>>>>                    backup(old)  backup(new)
>>>>>>>> ssd:hdd(direct)  3e+02        4.4
>>>>>>>>                                   -99%
>>>>>>>> ssd:hdd(cached)  5.7          5.4
>>>>>>>>                                   -5%
>>>>>>>> ---------------  -----------  -----------
>>>>>>>>
>>>>>>>> So, we have benefit even for cached mode! And the fastest thing is
>>>>>>>> O_DIRECT with new implemented cache. So, I suggest to enable the 
>>>>>>>> new
>>>>>>>> cache by default (which is done by the series).
>>>>>>>
>>>>>>> First, I’m not sure how O_DIRECT really is relevant, because I
>>>>>>> don’t really see the point for writing compressed images.
>>>>>>
>>>>>> compressed backup is a point
>>>>>
>>>>> (Perhaps irrelevant, but just to be clear:) I meant the point of
>>>>> using O_DIRECT, which one can decide to not use for backup targets
>>>>> (as you have done already).
>>>>>
>>>>>>> Second, I find it a bit cheating if you say there is a huge
>>>>>>> improvement for the no-cache case, when actually, well, you just
>>>>>>> added a cache.  So the no-cache case just became faster because
>>>>>>> there is a cache now.
>>>>>>
>>>>>> Still, performance comparison is relevant to show that O_DIRECT as
>>>>>> is unusable for compressed backup.
>>>>>
>>>>> (Again, perhaps irrelevant, but:) Yes, but my first point was
>>>>> exactly whether O_DIRECT is even relevant for writing compressed
>>>>> images.
>>>>>
>>>>>>> Well, I suppose I could follow that if O_DIRECT doesn’t make much
>>>>>>> sense for compressed images, qemu’s format drivers are free to
>>>>>>> introduce some caching (because technically the cache.direct
>>>>>>> option only applies to the protocol driver) for collecting
>>>>>>> compressed writes.
>>>>>>
>>>>>> Yes I thought in this way, enabling the cache by default.
>>>>>>
>>>>>>> That conclusion makes both of my complaints kind of moot.
>>>>>>>
>>>>>>> *shrug*
>>>>>>>
>>>>>>> Third, what is the real-world impact on the page cache?  You
>>>>>>> described that that’s the reason why you need the cache in qemu,
>>>>>>> because otherwise the page cache is polluted too much.  How much
>>>>>>> is the difference really?  (I don’t know how good the compression
>>>>>>> ratio is for real-world images.)
>>>>>>
>>>>>> Hm. I don't know the ratio.. Customer reported that most of RAM is
>>>>>> polluted by Qemu's cache, and we use O_DIRECT for everything except
>>>>>> for target of compressed backup.. Still the pollution may relate to
>>>>>> several backups and of course it is simple enough to drop the cache
>>>>>> after each backup. But I think that even one backup of 16T disk may
>>>>>> pollute RAM enough.
>>>>>
>>>>> Oh, sorry, I just realized I had a brain fart there.  I was
>>>>> referring to whether this series improves the page cache pollution.
>>>>> But obviously it will if it allows you to re-enable O_DIRECT.
>>>>>
>>>>>>> Related to that, I remember a long time ago we had some discussion
>>>>>>> about letting qemu-img convert set a special cache mode for the
>>>>>>> target image that would make Linux drop everything before the last
>>>>>>> offset written (i.e., I suppose fadvise() with
>>>>>>> POSIX_FADV_SEQUENTIAL).  You discard that idea based on the fact
>>>>>>> that implementing a cache in qemu would be simple, but it isn’t,
>>>>>>> really.  What would the impact of POSIX_FADV_SEQUENTIAL be?  (One
>>>>>>> advantage of using that would be that we could reuse it for
>>>>>>> non-compressed images that are written by backup or qemu-img
>>>>>>> convert.)
>>>>>>
>>>>>> The problem is that writes are async. And therefore, not sequential.
>>>>>
>>>>> In theory, yes, but all compressed writes still goes through
>>>>> qcow2_alloc_bytes() right before submitting the write, so I wonder
>>>>> whether in practice the writes aren’t usually sufficiently
>>>>> sequential to make POSIX_FADV_SEQUENTIAL work fine.
>>>>
>>>> Yes, allocation is sequential. But writes are not.. Reasonable, I
>>>> should at least bench it. So we should set POSIX_FADV_SEQUENTIAL for
>>>> the whole backup target before the backup? Will try. Still, I expect
>>>> that my cache will show better performance anyway. Actually,
>>>> comparing cached (by pagecache) vs my cache we have 5.7 -> 4.4, i.e.
>>>> 20% faster, which is significant (still, yes, would be good to check
>>>> it on more real case than all-ones).
>>>>
>>>>>
>>>>>> So
>>>>>> I have to track the writes and wait until the whole cluster is
>>>>>> filled. It's simple use fadvise as an option to my cache: instead
>>>>>> of caching data and write when cluster is filled we can instead
>>>>>> mark cluster POSIX_FADV_DONTNEED.
>>>>>>
>>>>>>>
>>>>>>> (I don’t remember why that qemu-img discussion died back then.)
>>>>>>>
>>>>>>>
>>>>>>> Fourth, regarding the code, would it be simpler if it were a pure
>>>>>>> write cache?  I.e., on read, everything is flushed, so we don’t
>>>>>>> have to deal with that.  I don’t think there are many valid cases
>>>>>>> where a compressed image is both written to and read from at the
>>>>>>> same time. (Just asking, because I’d really want this code to be
>>>>>>> simpler.  I can imagine that reading from the cache is the least
>>>>>>> bit of complexity, but perhaps...)
>>>>>>>
>>>>>>
>>>>>> Hm. I really didn't want to support reads, and do it only to make
>>>>>> it possible to enable the cache by default.. Still read function is
>>>>>> really simple, and I don't think that dropping it will simplify the
>>>>>> code significantly.
>>>>>
>>>>> That’s too bad.
>>>>>
>>>>> So the only question I have left is what POSIX_FADV_SEQUENTIAL
>>>>> actually would do in practice.
>>>>
>>>> will check.
>>>>
>>>
>>> Checked that if I mark the whole file by FADV_SEQUENTIAL, cache is not
>>> removed.
>>>
>>> Test:
>>> [root@kvm fadvise]# cat a.c
>>> #define _GNU_SOURCE
>>> #include <fcntl.h>
>>> #include <unistd.h>
>>> #include <stdio.h>
>>> #include <getopt.h>
>>> #include <string.h>
>>> #include <stdbool.h>
>>>
>>> int main(int argc, char *argv[])
>>> {
>>>      int fd;
>>>      int i;
>>>      char mb[1024 * 1024];
>>>      int open_flags = O_RDWR | O_CREAT | O_EXCL;
>>>      int fadv_flags = 0;
>>>      int fadv_final_flags = 0;
>>>      int len = 1024 * 1024;
>>>      bool do_fsync = false;
>>>
>>>      for (i = 1; i < argc - 1; i++) {
>>>          const char *arg = argv[i];
>>>
>>>          if (!strcmp(arg, "direct")) {
>>>              open_flags |= O_DIRECT;
>>>          } else if (!strcmp(arg, "seq")) {
>>>              fadv_flags = POSIX_FADV_SEQUENTIAL;
>>>          } else if (!strcmp(arg, "dontneed")) {
>>>              fadv_flags = POSIX_FADV_DONTNEED;
>>>          } else if (!strcmp(arg, "final-dontneed")) {
>>>              fadv_final_flags = POSIX_FADV_DONTNEED;
>>>          } else if (!strcmp(arg, "fsync")) {
>>>              do_fsync = true;
>>>          } else {
>>>              fprintf(stderr, "unknown: %s\n", arg);
>>>              return 1;
>>>          }
>>>      }
>>>
>>>      fd = open(argv[argc - 1], open_flags);
>>>
>>>      if (fd < 0) {
>>>          fprintf(stderr, "failed to open\n");
>>>          return 1;
>>>      }
>>>
>>>      if (fadv_flags) {
>>>          posix_fadvise(fd, 0, 100 * 1024 * 1024, fadv_flags);
>>>      }
>>>      for (i = 0; i < 100; i++) {
>>>          write(fd, mb, len);
>>>      }
>>>      if (fadv_final_flags) {
>>>          posix_fadvise(fd, 0, 100 * 1024 * 1024, fadv_final_flags);
>>>      }
>>>      if (do_fsync) {
>>>          fsync(fd);
>>>      }
>>>
>>>      close(fd);
>>> }
>>>
>>>
>>>
>>> [root@kvm fadvise]# gcc a.c
>>> [root@kvm fadvise]# rm -f x; ./a.out seq x; fincore x
>>>    RES PAGES  SIZE FILE
>>>   100M 25600  100M x
>>> [root@kvm fadvise]# rm -f x; ./a.out dontneed x; fincore x
>>>    RES PAGES  SIZE FILE
>>>   100M 25600  100M x
>>> [root@kvm fadvise]# rm -f x; ./a.out final-dontneed x; fincore x
>>>    RES PAGES  SIZE FILE
>>>    36M  9216  100M x
>>> [root@kvm fadvise]# rm -f x; ./a.out seq fsync x; fincore x
>>>    RES PAGES  SIZE FILE
>>>   100M 25600  100M x
>>> [root@kvm fadvise]# rm -f x; ./a.out dontneed fsync x; fincore x
>>>    RES PAGES  SIZE FILE
>>>   100M 25600  100M x
>>> [root@kvm fadvise]# rm -f x; ./a.out final-dontneed fsync x; fincore x
>>>    RES PAGES  SIZE FILE
>>>    36M  9216  100M x
>>> [root@kvm fadvise]# rm -f x; ./a.out direct x; fincore x
>>> RES PAGES SIZE FILE
>>>   0B     0   0B x
>>>
>>>
>>>
>>> Backup-generated pagecache is a formal trash, it will be never used.
>>> And it's bad that it can displace another good pagecache. So the best
>>> thing for backup is direct mode + proposed cache.

What a shame.

Thanks a lot for testing.

>> I think that the original intention of Max is about POSIX_FADV_DONTNEED
>> One should call this fadvise for just fully written cluster.

I had hoped that SEQUENTIAL would just work, magically.

> This should work, but:
> 
>   - as we see from test above, POSIX_FADV_DONTNEED don't remove the 
> whole cache (see final-dontneed)
>   - as I said we'll have to track writes, so the cache will be the same, 
> just instead of postponed-write operation we'll do fadvise.
> 
> Hmm. Still, in this way, we will not need some difficult things in my 
> proposed cache.
> 
> So, it may be reasonable to at least split the big patch so that
> 
>   - first part brings writes / full-cluster tracking and fadvice
> 
>   - second part adds caching-wrties option, corresponding flush code and 
> additional performance
> 
> Does it make sense?

I think the fadvise solution would have been nice if it gave us 
something magical that we could also use for normal qemu-img convert (or 
backup) operations, but as that doesn’t seem to be the case, I don’t 
think it makes too much sense to implement something like that.  (I 
imagine doing fadvise also creates the need to implement new block to 
call into file-posix and so on.)

I’d propose I take some time to look at your patch as-is and then I 
report back.

Max



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/7] qcow2: compressed write cache
  2021-02-09 16:52       ` Denis V. Lunev
@ 2021-02-10 10:00         ` Max Reitz
  0 siblings, 0 replies; 32+ messages in thread
From: Max Reitz @ 2021-02-10 10:00 UTC (permalink / raw)
  To: Denis V. Lunev, Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, qemu-devel, armbru

On 09.02.21 17:52, Denis V. Lunev wrote:
> On 2/9/21 5:47 PM, Max Reitz wrote:
>> On 09.02.21 15:10, Vladimir Sementsov-Ogievskiy wrote:
>>> 09.02.2021 16:25, Max Reitz wrote:
>>>> On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
>>>>> Hi all!
>>>>>
>>>>> I know, I have several series waiting for a resend, but I had to
>>>>> switch
>>>>> to another task spawned from our customer's bug.
>>>>>
>>>>> Original problem: we use O_DIRECT for all vm images in our product,
>>>>> it's
>>>>> the policy. The only exclusion is backup target qcow2 image for
>>>>> compressed backup, because compressed backup is extremely slow with
>>>>> O_DIRECT (due to unaligned writes). Customer complains that backup
>>>>> produces a lot of pagecache.
>>>>>
>>>>> So we can either implement some internal cache or use fadvise somehow.
>>>>> Backup has several async workes, which writes simultaneously, so in
>>>>> both
>>>>> ways we have to track host cluster filling (before dropping the cache
>>>>> corresponding to the cluster).  So, if we have to track anyway, let's
>>>>> try to implement the cache.
>>>>
>>>> I wanted to be excited here, because that sounds like it would be
>>>> very easy to implement caching.  Like, just keep the cluster at
>>>> free_byte_offset cached until the cluster it points to changes, then
>>>> flush the cluster.
>>>
>>> The problem is that chunks are written asynchronously.. That's why
>>> this all is not so easy.
>>>
>>>>
>>>> But then I see like 900 new lines of code, and I’m much less excited...
>>>>
>>>>> Idea is simple: cache small unaligned write and flush the cluster when
>>>>> filled.
>>>>>
>>>>> Performance result is very good (results in a table is time of
>>>>> compressed backup of 1000M disk filled with ones in seconds):
>>>>
>>>> “Filled with ones” really is an edge case, though.
>>>
>>> Yes, I think, all clusters are compressed to rather small chunks :)
>>>
>>>>
>>>>> ---------------  -----------  -----------
>>>>>                    backup(old)  backup(new)
>>>>> ssd:hdd(direct)  3e+02        4.4
>>>>>                                   -99%
>>>>> ssd:hdd(cached)  5.7          5.4
>>>>>                                   -5%
>>>>> ---------------  -----------  -----------
>>>>>
>>>>> So, we have benefit even for cached mode! And the fastest thing is
>>>>> O_DIRECT with new implemented cache. So, I suggest to enable the new
>>>>> cache by default (which is done by the series).
>>>>
>>>> First, I’m not sure how O_DIRECT really is relevant, because I don’t
>>>> really see the point for writing compressed images.
>>>
>>> compressed backup is a point
>>
>> (Perhaps irrelevant, but just to be clear:) I meant the point of using
>> O_DIRECT, which one can decide to not use for backup targets (as you
>> have done already).
>>
>>>> Second, I find it a bit cheating if you say there is a huge
>>>> improvement for the no-cache case, when actually, well, you just
>>>> added a cache.  So the no-cache case just became faster because
>>>> there is a cache now.
>>>
>>> Still, performance comparison is relevant to show that O_DIRECT as is
>>> unusable for compressed backup.
>>
>> (Again, perhaps irrelevant, but:) Yes, but my first point was exactly
>> whether O_DIRECT is even relevant for writing compressed images.
>>
>>>> Well, I suppose I could follow that if O_DIRECT doesn’t make much
>>>> sense for compressed images, qemu’s format drivers are free to
>>>> introduce some caching (because technically the cache.direct option
>>>> only applies to the protocol driver) for collecting compressed writes.
>>>
>>> Yes I thought in this way, enabling the cache by default.
>>>
>>>> That conclusion makes both of my complaints kind of moot.
>>>>
>>>> *shrug*
>>>>
>>>> Third, what is the real-world impact on the page cache?  You
>>>> described that that’s the reason why you need the cache in qemu,
>>>> because otherwise the page cache is polluted too much.  How much is
>>>> the difference really?  (I don’t know how good the compression ratio
>>>> is for real-world images.)
>>>
>>> Hm. I don't know the ratio.. Customer reported that most of RAM is
>>> polluted by Qemu's cache, and we use O_DIRECT for everything except
>>> for target of compressed backup.. Still the pollution may relate to
>>> several backups and of course it is simple enough to drop the cache
>>> after each backup. But I think that even one backup of 16T disk may
>>> pollute RAM enough.
>>
>> Oh, sorry, I just realized I had a brain fart there.  I was referring
>> to whether this series improves the page cache pollution.  But
>> obviously it will if it allows you to re-enable O_DIRECT.
>>
>>>> Related to that, I remember a long time ago we had some discussion
>>>> about letting qemu-img convert set a special cache mode for the
>>>> target image that would make Linux drop everything before the last
>>>> offset written (i.e., I suppose fadvise() with
>>>> POSIX_FADV_SEQUENTIAL).  You discard that idea based on the fact
>>>> that implementing a cache in qemu would be simple, but it isn’t,
>>>> really.  What would the impact of POSIX_FADV_SEQUENTIAL be?  (One
>>>> advantage of using that would be that we could reuse it for
>>>> non-compressed images that are written by backup or qemu-img convert.)
>>>
>>> The problem is that writes are async. And therefore, not sequential.
>>
>> In theory, yes, but all compressed writes still goes through
>> qcow2_alloc_bytes() right before submitting the write, so I wonder
>> whether in practice the writes aren’t usually sufficiently sequential
>> to make POSIX_FADV_SEQUENTIAL work fine.
>>
>>> So
>>> I have to track the writes and wait until the whole cluster is
>>> filled. It's simple use fadvise as an option to my cache: instead of
>>> caching data and write when cluster is filled we can instead mark
>>> cluster POSIX_FADV_DONTNEED.
>>>
>>>>
>>>> (I don’t remember why that qemu-img discussion died back then.)
>>>>
>>>>
>>>> Fourth, regarding the code, would it be simpler if it were a pure
>>>> write cache?  I.e., on read, everything is flushed, so we don’t have
>>>> to deal with that.  I don’t think there are many valid cases where a
>>>> compressed image is both written to and read from at the same time.
>>>> (Just asking, because I’d really want this code to be simpler.  I
>>>> can imagine that reading from the cache is the least bit of
>>>> complexity, but perhaps...)
>>>>
>>>
>>> Hm. I really didn't want to support reads, and do it only to make it
>>> possible to enable the cache by default.. Still read function is
>>> really simple, and I don't think that dropping it will simplify the
>>> code significantly.
>>
>> That’s too bad.
>>
>> So the only question I have left is what POSIX_FADV_SEQUENTIAL
>> actually would do in practice.
>>
>> (But even then, the premise just doesn’t motivate me sufficiently yet...)
>>
> POSIX_FADV_SEQUENTIAL influences the amount of the read-ahead
> only.
> 
> int generic_fadvise(struct file *file, loff_t offset, loff_t len, int
> advice)
> {
> .....
>      case POSIX_FADV_NORMAL:
>          file->f_ra.ra_pages = bdi->ra_pages;
>          spin_lock(&file->f_lock);
>          file->f_mode &= ~FMODE_RANDOM;
>          spin_unlock(&file->f_lock);
>          break;
>      case POSIX_FADV_SEQUENTIAL:
>          file->f_ra.ra_pages = bdi->ra_pages * 2;
>          spin_lock(&file->f_lock);
>          file->f_mode &= ~FMODE_RANDOM;
>          spin_unlock(&file->f_lock);
>          break;
> 
> thus the only difference from the usual NORMAL mode would be
> doubled amount of read-ahead pages.

:/

Thanks for looking it up.

Max



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/7] qcow2: compressed write cache
  2021-02-10 10:00               ` Max Reitz
@ 2021-02-10 10:10                 ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-02-10 10:10 UTC (permalink / raw)
  To: Max Reitz, Denis V. Lunev, qemu-block; +Cc: qemu-devel, eblake, armbru, kwolf

10.02.2021 13:00, Max Reitz wrote:
> On 09.02.21 19:51, Vladimir Sementsov-Ogievskiy wrote:
>> 09.02.2021 21:41, Denis V. Lunev wrote:
>>> On 2/9/21 9:36 PM, Vladimir Sementsov-Ogievskiy wrote:
>>>> 09.02.2021 19:39, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 09.02.2021 17:47, Max Reitz wrote:
>>>>>> On 09.02.21 15:10, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>> 09.02.2021 16:25, Max Reitz wrote:
>>>>>>>> On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>>>> Hi all!
>>>>>>>>>
>>>>>>>>> I know, I have several series waiting for a resend, but I had to
>>>>>>>>> switch
>>>>>>>>> to another task spawned from our customer's bug.
>>>>>>>>>
>>>>>>>>> Original problem: we use O_DIRECT for all vm images in our
>>>>>>>>> product, it's
>>>>>>>>> the policy. The only exclusion is backup target qcow2 image for
>>>>>>>>> compressed backup, because compressed backup is extremely slow with
>>>>>>>>> O_DIRECT (due to unaligned writes). Customer complains that backup
>>>>>>>>> produces a lot of pagecache.
>>>>>>>>>
>>>>>>>>> So we can either implement some internal cache or use fadvise
>>>>>>>>> somehow.
>>>>>>>>> Backup has several async workes, which writes simultaneously, so
>>>>>>>>> in both
>>>>>>>>> ways we have to track host cluster filling (before dropping the
>>>>>>>>> cache
>>>>>>>>> corresponding to the cluster).  So, if we have to track anyway,
>>>>>>>>> let's
>>>>>>>>> try to implement the cache.
>>>>>>>>
>>>>>>>> I wanted to be excited here, because that sounds like it would be
>>>>>>>> very easy to implement caching.  Like, just keep the cluster at
>>>>>>>> free_byte_offset cached until the cluster it points to changes,
>>>>>>>> then flush the cluster.
>>>>>>>
>>>>>>> The problem is that chunks are written asynchronously.. That's why
>>>>>>> this all is not so easy.
>>>>>>>
>>>>>>>>
>>>>>>>> But then I see like 900 new lines of code, and I’m much less
>>>>>>>> excited...
>>>>>>>>
>>>>>>>>> Idea is simple: cache small unaligned write and flush the cluster
>>>>>>>>> when
>>>>>>>>> filled.
>>>>>>>>>
>>>>>>>>> Performance result is very good (results in a table is time of
>>>>>>>>> compressed backup of 1000M disk filled with ones in seconds):
>>>>>>>>
>>>>>>>> “Filled with ones” really is an edge case, though.
>>>>>>>
>>>>>>> Yes, I think, all clusters are compressed to rather small chunks :)
>>>>>>>
>>>>>>>>
>>>>>>>>> ---------------  -----------  -----------
>>>>>>>>>                    backup(old)  backup(new)
>>>>>>>>> ssd:hdd(direct)  3e+02        4.4
>>>>>>>>>                                   -99%
>>>>>>>>> ssd:hdd(cached)  5.7          5.4
>>>>>>>>>                                   -5%
>>>>>>>>> ---------------  -----------  -----------
>>>>>>>>>
>>>>>>>>> So, we have benefit even for cached mode! And the fastest thing is
>>>>>>>>> O_DIRECT with new implemented cache. So, I suggest to enable the new
>>>>>>>>> cache by default (which is done by the series).
>>>>>>>>
>>>>>>>> First, I’m not sure how O_DIRECT really is relevant, because I
>>>>>>>> don’t really see the point for writing compressed images.
>>>>>>>
>>>>>>> compressed backup is a point
>>>>>>
>>>>>> (Perhaps irrelevant, but just to be clear:) I meant the point of
>>>>>> using O_DIRECT, which one can decide to not use for backup targets
>>>>>> (as you have done already).
>>>>>>
>>>>>>>> Second, I find it a bit cheating if you say there is a huge
>>>>>>>> improvement for the no-cache case, when actually, well, you just
>>>>>>>> added a cache.  So the no-cache case just became faster because
>>>>>>>> there is a cache now.
>>>>>>>
>>>>>>> Still, performance comparison is relevant to show that O_DIRECT as
>>>>>>> is unusable for compressed backup.
>>>>>>
>>>>>> (Again, perhaps irrelevant, but:) Yes, but my first point was
>>>>>> exactly whether O_DIRECT is even relevant for writing compressed
>>>>>> images.
>>>>>>
>>>>>>>> Well, I suppose I could follow that if O_DIRECT doesn’t make much
>>>>>>>> sense for compressed images, qemu’s format drivers are free to
>>>>>>>> introduce some caching (because technically the cache.direct
>>>>>>>> option only applies to the protocol driver) for collecting
>>>>>>>> compressed writes.
>>>>>>>
>>>>>>> Yes I thought in this way, enabling the cache by default.
>>>>>>>
>>>>>>>> That conclusion makes both of my complaints kind of moot.
>>>>>>>>
>>>>>>>> *shrug*
>>>>>>>>
>>>>>>>> Third, what is the real-world impact on the page cache?  You
>>>>>>>> described that that’s the reason why you need the cache in qemu,
>>>>>>>> because otherwise the page cache is polluted too much.  How much
>>>>>>>> is the difference really?  (I don’t know how good the compression
>>>>>>>> ratio is for real-world images.)
>>>>>>>
>>>>>>> Hm. I don't know the ratio.. Customer reported that most of RAM is
>>>>>>> polluted by Qemu's cache, and we use O_DIRECT for everything except
>>>>>>> for target of compressed backup.. Still the pollution may relate to
>>>>>>> several backups and of course it is simple enough to drop the cache
>>>>>>> after each backup. But I think that even one backup of 16T disk may
>>>>>>> pollute RAM enough.
>>>>>>
>>>>>> Oh, sorry, I just realized I had a brain fart there.  I was
>>>>>> referring to whether this series improves the page cache pollution.
>>>>>> But obviously it will if it allows you to re-enable O_DIRECT.
>>>>>>
>>>>>>>> Related to that, I remember a long time ago we had some discussion
>>>>>>>> about letting qemu-img convert set a special cache mode for the
>>>>>>>> target image that would make Linux drop everything before the last
>>>>>>>> offset written (i.e., I suppose fadvise() with
>>>>>>>> POSIX_FADV_SEQUENTIAL).  You discard that idea based on the fact
>>>>>>>> that implementing a cache in qemu would be simple, but it isn’t,
>>>>>>>> really.  What would the impact of POSIX_FADV_SEQUENTIAL be?  (One
>>>>>>>> advantage of using that would be that we could reuse it for
>>>>>>>> non-compressed images that are written by backup or qemu-img
>>>>>>>> convert.)
>>>>>>>
>>>>>>> The problem is that writes are async. And therefore, not sequential.
>>>>>>
>>>>>> In theory, yes, but all compressed writes still goes through
>>>>>> qcow2_alloc_bytes() right before submitting the write, so I wonder
>>>>>> whether in practice the writes aren’t usually sufficiently
>>>>>> sequential to make POSIX_FADV_SEQUENTIAL work fine.
>>>>>
>>>>> Yes, allocation is sequential. But writes are not.. Reasonable, I
>>>>> should at least bench it. So we should set POSIX_FADV_SEQUENTIAL for
>>>>> the whole backup target before the backup? Will try. Still, I expect
>>>>> that my cache will show better performance anyway. Actually,
>>>>> comparing cached (by pagecache) vs my cache we have 5.7 -> 4.4, i.e.
>>>>> 20% faster, which is significant (still, yes, would be good to check
>>>>> it on more real case than all-ones).
>>>>>
>>>>>>
>>>>>>> So
>>>>>>> I have to track the writes and wait until the whole cluster is
>>>>>>> filled. It's simple use fadvise as an option to my cache: instead
>>>>>>> of caching data and write when cluster is filled we can instead
>>>>>>> mark cluster POSIX_FADV_DONTNEED.
>>>>>>>
>>>>>>>>
>>>>>>>> (I don’t remember why that qemu-img discussion died back then.)
>>>>>>>>
>>>>>>>>
>>>>>>>> Fourth, regarding the code, would it be simpler if it were a pure
>>>>>>>> write cache?  I.e., on read, everything is flushed, so we don’t
>>>>>>>> have to deal with that.  I don’t think there are many valid cases
>>>>>>>> where a compressed image is both written to and read from at the
>>>>>>>> same time. (Just asking, because I’d really want this code to be
>>>>>>>> simpler.  I can imagine that reading from the cache is the least
>>>>>>>> bit of complexity, but perhaps...)
>>>>>>>>
>>>>>>>
>>>>>>> Hm. I really didn't want to support reads, and do it only to make
>>>>>>> it possible to enable the cache by default.. Still read function is
>>>>>>> really simple, and I don't think that dropping it will simplify the
>>>>>>> code significantly.
>>>>>>
>>>>>> That’s too bad.
>>>>>>
>>>>>> So the only question I have left is what POSIX_FADV_SEQUENTIAL
>>>>>> actually would do in practice.
>>>>>
>>>>> will check.
>>>>>
>>>>
>>>> Checked that if I mark the whole file by FADV_SEQUENTIAL, cache is not
>>>> removed.
>>>>
>>>> Test:
>>>> [root@kvm fadvise]# cat a.c
>>>> #define _GNU_SOURCE
>>>> #include <fcntl.h>
>>>> #include <unistd.h>
>>>> #include <stdio.h>
>>>> #include <getopt.h>
>>>> #include <string.h>
>>>> #include <stdbool.h>
>>>>
>>>> int main(int argc, char *argv[])
>>>> {
>>>>      int fd;
>>>>      int i;
>>>>      char mb[1024 * 1024];
>>>>      int open_flags = O_RDWR | O_CREAT | O_EXCL;
>>>>      int fadv_flags = 0;
>>>>      int fadv_final_flags = 0;
>>>>      int len = 1024 * 1024;
>>>>      bool do_fsync = false;
>>>>
>>>>      for (i = 1; i < argc - 1; i++) {
>>>>          const char *arg = argv[i];
>>>>
>>>>          if (!strcmp(arg, "direct")) {
>>>>              open_flags |= O_DIRECT;
>>>>          } else if (!strcmp(arg, "seq")) {
>>>>              fadv_flags = POSIX_FADV_SEQUENTIAL;
>>>>          } else if (!strcmp(arg, "dontneed")) {
>>>>              fadv_flags = POSIX_FADV_DONTNEED;
>>>>          } else if (!strcmp(arg, "final-dontneed")) {
>>>>              fadv_final_flags = POSIX_FADV_DONTNEED;
>>>>          } else if (!strcmp(arg, "fsync")) {
>>>>              do_fsync = true;
>>>>          } else {
>>>>              fprintf(stderr, "unknown: %s\n", arg);
>>>>              return 1;
>>>>          }
>>>>      }
>>>>
>>>>      fd = open(argv[argc - 1], open_flags);
>>>>
>>>>      if (fd < 0) {
>>>>          fprintf(stderr, "failed to open\n");
>>>>          return 1;
>>>>      }
>>>>
>>>>      if (fadv_flags) {
>>>>          posix_fadvise(fd, 0, 100 * 1024 * 1024, fadv_flags);
>>>>      }
>>>>      for (i = 0; i < 100; i++) {
>>>>          write(fd, mb, len);
>>>>      }
>>>>      if (fadv_final_flags) {
>>>>          posix_fadvise(fd, 0, 100 * 1024 * 1024, fadv_final_flags);
>>>>      }
>>>>      if (do_fsync) {
>>>>          fsync(fd);
>>>>      }
>>>>
>>>>      close(fd);
>>>> }
>>>>
>>>>
>>>>
>>>> [root@kvm fadvise]# gcc a.c
>>>> [root@kvm fadvise]# rm -f x; ./a.out seq x; fincore x
>>>>    RES PAGES  SIZE FILE
>>>>   100M 25600  100M x
>>>> [root@kvm fadvise]# rm -f x; ./a.out dontneed x; fincore x
>>>>    RES PAGES  SIZE FILE
>>>>   100M 25600  100M x
>>>> [root@kvm fadvise]# rm -f x; ./a.out final-dontneed x; fincore x
>>>>    RES PAGES  SIZE FILE
>>>>    36M  9216  100M x
>>>> [root@kvm fadvise]# rm -f x; ./a.out seq fsync x; fincore x
>>>>    RES PAGES  SIZE FILE
>>>>   100M 25600  100M x
>>>> [root@kvm fadvise]# rm -f x; ./a.out dontneed fsync x; fincore x
>>>>    RES PAGES  SIZE FILE
>>>>   100M 25600  100M x
>>>> [root@kvm fadvise]# rm -f x; ./a.out final-dontneed fsync x; fincore x
>>>>    RES PAGES  SIZE FILE
>>>>    36M  9216  100M x
>>>> [root@kvm fadvise]# rm -f x; ./a.out direct x; fincore x
>>>> RES PAGES SIZE FILE
>>>>   0B     0   0B x
>>>>
>>>>
>>>>
>>>> Backup-generated pagecache is a formal trash, it will be never used.
>>>> And it's bad that it can displace another good pagecache. So the best
>>>> thing for backup is direct mode + proposed cache.
> 
> What a shame.
> 
> Thanks a lot for testing.
> 
>>> I think that the original intention of Max is about POSIX_FADV_DONTNEED
>>> One should call this fadvise for just fully written cluster.
> 
> I had hoped that SEQUENTIAL would just work, magically.
> 
>> This should work, but:
>>
>>   - as we see from test above, POSIX_FADV_DONTNEED don't remove the whole cache (see final-dontneed)
>>   - as I said we'll have to track writes, so the cache will be the same, just instead of postponed-write operation we'll do fadvise.
>>
>> Hmm. Still, in this way, we will not need some difficult things in my proposed cache.
>>
>> So, it may be reasonable to at least split the big patch so that
>>
>>   - first part brings writes / full-cluster tracking and fadvice
>>
>>   - second part adds caching-wrties option, corresponding flush code and additional performance
>>
>> Does it make sense?
> 
> I think the fadvise solution would have been nice if it gave us something magical that we could also use for normal qemu-img convert (or backup) operations, but as that doesn’t seem to be the case, I don’t think it makes too much sense to implement something like that.  (I imagine doing fadvise also creates the need to implement new block to call into file-posix and so on.)

Actually, with any kind of qemu-img convert, we shouldn't benefit of pagecache, as we write with large enough good aligned chunks. So probably all such tasks should prefer O_DIRECT mode to not produce extra pagecache. And in this way my patch helps for compressed qemu-img convert with target opened in O_DIRECT mode as well. So instead of using fadvise for all such tasks, we can use O_DIRECT for all of them, having compression powered by compressed-cache to work well with O_DIRECT.

> 
> I’d propose I take some time to look at your patch as-is and then I report back.
> 

Thanks!


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/7] qcow2: compressed write cache
  2021-01-29 16:50 [PATCH 0/7] qcow2: compressed write cache Vladimir Sementsov-Ogievskiy
                   ` (8 preceding siblings ...)
  2021-02-09 13:25 ` Max Reitz
@ 2021-02-10 12:35 ` Kevin Wolf
  2021-02-10 14:35   ` Vladimir Sementsov-Ogievskiy
  9 siblings, 1 reply; 32+ messages in thread
From: Kevin Wolf @ 2021-02-10 12:35 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy; +Cc: qemu-block, armbru, qemu-devel, mreitz, den

Am 29.01.2021 um 17:50 hat Vladimir Sementsov-Ogievskiy geschrieben:
> Hi all!
> 
> I know, I have several series waiting for a resend, but I had to switch
> to another task spawned from our customer's bug.
> 
> Original problem: we use O_DIRECT for all vm images in our product, it's
> the policy. The only exclusion is backup target qcow2 image for
> compressed backup, because compressed backup is extremely slow with
> O_DIRECT (due to unaligned writes). Customer complains that backup
> produces a lot of pagecache.
> 
> So we can either implement some internal cache or use fadvise somehow.
> Backup has several async workes, which writes simultaneously, so in both
> ways we have to track host cluster filling (before dropping the cache
> corresponding to the cluster).  So, if we have to track anyway, let's
> try to implement the cache.
> 
> Idea is simple: cache small unaligned write and flush the cluster when
> filled.

I haven't had the time to properly look at the patches, but is there
anything in it that is actually specific to compressed writes?

I'm asking because you may remember that a few years ago I talked at KVM
Forum about how a data cache could be used for small unaligned (to
cluster sizes) writes to reduce COW cost (mostly for sequential access
where the other part of the cluster would be filled soon enough).

So if we're introducing some kind of data cache, wouldn't it be nice to
use it even in the more general case instead of just restricting it to
compression?

Kevin



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/7] qcow2: compressed write cache
  2021-02-10 12:35 ` Kevin Wolf
@ 2021-02-10 14:35   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-02-10 14:35 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, qemu-devel, eblake, armbru, mreitz, den

10.02.2021 15:35, Kevin Wolf wrote:
> Am 29.01.2021 um 17:50 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> Hi all!
>>
>> I know, I have several series waiting for a resend, but I had to switch
>> to another task spawned from our customer's bug.
>>
>> Original problem: we use O_DIRECT for all vm images in our product, it's
>> the policy. The only exclusion is backup target qcow2 image for
>> compressed backup, because compressed backup is extremely slow with
>> O_DIRECT (due to unaligned writes). Customer complains that backup
>> produces a lot of pagecache.
>>
>> So we can either implement some internal cache or use fadvise somehow.
>> Backup has several async workes, which writes simultaneously, so in both
>> ways we have to track host cluster filling (before dropping the cache
>> corresponding to the cluster).  So, if we have to track anyway, let's
>> try to implement the cache.
>>
>> Idea is simple: cache small unaligned write and flush the cluster when
>> filled.
> 
> I haven't had the time to properly look at the patches, but is there
> anything in it that is actually specific to compressed writes?
> 
> I'm asking because you may remember that a few years ago I talked at KVM
> Forum about how a data cache could be used for small unaligned (to
> cluster sizes) writes to reduce COW cost (mostly for sequential access
> where the other part of the cluster would be filled soon enough).
> 
> So if we're introducing some kind of data cache, wouldn't it be nice to
> use it even in the more general case instead of just restricting it to
> compression?
> 

Specific things are:

  - setting data_end per cluster at some moment (so we flush the cluster when it is not full) In this case we align up the data_end, as we know that the remaining part of cluster is unused. But, that may be refactored as an option.
  - wait for the whole cluster filled

So it can be reused for some sequential (more or less) copying process with unaligned chunks.. But different copying jobs in qemu always have aligned chunks, the only exclusion is copying to compressed target..

Still I intentionally implemented it in a separate file, and there no use of BDRVQcow2State, so it's simple enough to refactor and reuse if needed.

I can rename it to "unaligned_copy_cache" or something like this.

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/7] block/qcow2: introduce cache for compressed writes
  2021-01-29 16:50 ` [PATCH 2/7] block/qcow2: introduce cache for compressed writes Vladimir Sementsov-Ogievskiy
@ 2021-02-10 17:07   ` Max Reitz
  2021-02-11 12:49     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 32+ messages in thread
From: Max Reitz @ 2021-02-10 17:07 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: kwolf, den, qemu-devel, armbru

On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
> Compressed writes and O_DIRECT are not friends: they works too slow,
> because compressed writes does many small unaligned to 512 writes.
> 
> Let's introduce an internal cache, so that compressed writes may work
> well when O_DIRECT is on.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>   block/qcow2.h                        |  29 +
>   block/qcow2-compressed-write-cache.c | 770 +++++++++++++++++++++++++++
>   block/meson.build                    |   1 +
>   3 files changed, 800 insertions(+)
>   create mode 100644 block/qcow2-compressed-write-cache.c
> 
> diff --git a/block/qcow2.h b/block/qcow2.h
> index 0678073b74..fbdedf89fa 100644
> --- a/block/qcow2.h
> +++ b/block/qcow2.h
> @@ -322,6 +322,8 @@ typedef struct Qcow2BitmapHeaderExt {
>       uint64_t bitmap_directory_offset;
>   } QEMU_PACKED Qcow2BitmapHeaderExt;
>   
> +typedef struct Qcow2CompressedWriteCache Qcow2CompressedWriteCache;
> +
>   #define QCOW2_MAX_THREADS 4
>   
>   typedef struct BDRVQcow2State {
> @@ -1010,4 +1012,31 @@ int coroutine_fn
>   qcow2_co_decrypt(BlockDriverState *bs, uint64_t host_offset,
>                    uint64_t guest_offset, void *buf, size_t len);
>   
> +Qcow2CompressedWriteCache *qcow2_compressed_cache_new(BdrvChild *data_file,
> +                                                      int64_t cluster_size,
> +                                                      int64_t cache_size);
> +void qcow2_compressed_cache_free(Qcow2CompressedWriteCache *s);
> +int coroutine_fn
> +qcow2_compressed_cache_co_read(Qcow2CompressedWriteCache *s, int64_t offset,
> +                               int64_t bytes, void *buf);
> +int coroutine_fn
> +qcow2_compressed_cache_co_write(Qcow2CompressedWriteCache *s, int64_t offset,
> +                                int64_t bytes, void *buf);
> +void coroutine_fn
> +qcow2_compressed_cache_co_set_cluster_end(Qcow2CompressedWriteCache *s,
> +                                          int64_t cluster_data_end);
> +int coroutine_fn
> +qcow2_compressed_cache_co_flush(Qcow2CompressedWriteCache *s);
> +int qcow2_compressed_cache_flush(BlockDriverState *bs,
> +                                 Qcow2CompressedWriteCache *state);
> +int coroutine_fn
> +qcow2_compressed_cache_co_stop_flush(Qcow2CompressedWriteCache *s);
> +int qcow2_compressed_cache_stop_flush(BlockDriverState *bs,
> +                                      Qcow2CompressedWriteCache *s);
> +void qcow2_compressed_cache_set_size(Qcow2CompressedWriteCache *s,
> +                                     int64_t size);
> +void coroutine_fn
> +qcow2_compressed_cache_co_discard(Qcow2CompressedWriteCache *s,
> +                                  int64_t cluster_offset);
> +

It would be nice if these functions had their interface documented 
somewhere.

>   #endif
> diff --git a/block/qcow2-compressed-write-cache.c b/block/qcow2-compressed-write-cache.c
> new file mode 100644
> index 0000000000..7bb92cb550
> --- /dev/null
> +++ b/block/qcow2-compressed-write-cache.c
> @@ -0,0 +1,770 @@
> +/*
> + * Write cache for qcow2 compressed writes
> + *
> + * Copyright (c) 2021 Virtuozzo International GmbH.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> + * of this software and associated documentation files (the "Software"), to deal
> + * in the Software without restriction, including without limitation the rights
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> + * copies of the Software, and to permit persons to whom the Software is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> + * THE SOFTWARE.
> + */
> +
> +#include "qemu/osdep.h"
> +
> +#include "block/block_int.h"
> +#include "block/block-gen.h"
> +#include "qemu/coroutine.h"
> +#include "qapi/qapi-events-block-core.h"
> +#include "qcow2.h"
> +
> +typedef struct CacheExtent {
> +    int64_t offset;
> +    int64_t bytes;
> +    void *buf;
> +    QLIST_ENTRY(CacheExtent) next;
> +} CacheExtent;
> +
> +typedef struct CacheCluster {

It isn’t immediately clear what these two structures mean by just their 
name, because “extent” has no meaning in the context of qcow2.

I understand CacheExtent to basically be a compressed cluster, and 
CacheCluster to be a host cluster.  Perhaps their names should reflect that.

(OTOH, the Cache* prefix seems unnecessary to me, because these are just 
local structs.)

> +    int64_t cluster_offset;
> +    int64_t n_bytes; /* sum of extents lengths */
> +
> +    /*
> +     * data_end: cluster is full if data_end reached and ready to be flushed.
> +     * data_end is absoluted offset, not relative.

*absolute

> +     */
> +    int64_t data_end;
> +
> +    bool in_flight; /* cluster is being flushed now */
> +
> +    /*
> +     * waiters: coroutines to wake after flush.

No, once in_flight is reset to false, which may happen even if the 
cluster hasn’t been flushed.


As a note that doesn’t really belong anywhere, I don’t like the current 
use of this queue perfectly well.

This is how it is used right now:

Some context A sets in_flight and flushes the cluster.

Some context B wants to evict the cluster from the cache.  It sees 
in_flight, so it gets into the queue.  (If in_flight were not set, B 
would remove the cluster from the cache.)

A finishes flushing, resets in_flight, queues B.  Then if flushing 
succeeded, it will evict the cluster from the cache.  If it failed, the 
cluster remains in the cache.

Then B is executed.  If the cluster isn’t in the cache anymore, it’s 
done.  If still is, it will remove it, and then be done.

My problem is that if B were executed right where it’s enqueued (with 
queue_restart_all()), we would crash because B would see the cluster in 
the cache, remove it, and then A would have a dangling reference and 
remove the (freed) cluster again.

So, well, it works, but I personally find it strange to “invoke“ B where 
you don’t actually want it to run.  I think the @waiters queue should 
only be restarted once A has had the chance to remove the cluster from 
the cache, or to be more general, once A no longer has a reference to 
the cluster.

tl;dr: I think the @waiters queue should only be restarted once the 
reference to the cluster is no longer going to be used.

>                                                  Must be empty when in_flight is
> +     * false
> +     */
> +    CoQueue waiters;
> +
> +    QTAILQ_ENTRY(CacheCluster) next;
> +    QLIST_HEAD(, CacheExtent) extents; /* sorted by offset */
> +} CacheCluster;
> +
> +typedef QTAILQ_HEAD(, CacheCluster) CacheClusterQueue;

Why the typedef?  It’s used one time.

> +
> +struct Qcow2CompressedWriteCache {
> +    /*
> +     * @data_file and @cluster_size are copied from qcow2 state. Not huge
> +     * duplications, seems better to avoid accessing the whole qcow2 state
> +     * instead.

Why?  What if in the future the data_file child can be changed with reopen?

What’s the argument against accessing the whole qcow2 state?  That it’s 
a layering violation, that we shouldn’t access it while the rest uses 
it?  If that’s a concern, perhaps *data_file should be an indirect 
pointer to the pointer in the BDRVQcow2State.

> +     */
> +    BdrvChild *data_file;
> +    int64_t cluster_size;
> +
> +    CoQueue waiters; /* coroutines, waiting for free place in the cache */
> +
> +    /*
> +     * Cache limits not total number of clusters but total number of active
> +     * clusters. Active clusters are clusters with non-zero @n_bytes (and
> +     * therefor non-empty @extents). This is done so we don't need to wait for
> +     * cache flush qcow2_compressed_cache_set_cluster_end() (which may create
> +     * cluster with @data_end set but zero @n_bytes), as
> +     * qcow2_compressed_cache_set_cluster_end() is intended to be called from
> +     * qcow2 mutex critical section.
> +     */
> +    int nb_active_clusters;
> +
> +    /*
> +     * If max_active_clusters is 0 it means that cache is inactive: all new
> +     * writes should fallthrough to data_file immediately.
> +     */
> +    int max_active_clusters;
> +
> +    CacheClusterQueue clusters;
> +};
> +
> +/* New extent takes ownership of @buf */
> +static CacheExtent *cache_extent_new(int64_t offset, int64_t bytes, void *buf)
> +{
> +    CacheExtent *e = g_new(CacheExtent, 1);
> +
> +    *e = (CacheExtent) {
> +        .offset = offset,
> +        .bytes = bytes,
> +        .buf = buf,
> +    };
> +
> +    return e;
> +}
> +
> +static void cache_extent_free(CacheExtent *e)
> +{
> +    if (e) {
> +        g_free(e->buf);
> +        g_free(e);
> +    }
> +}
> +
> +static CacheCluster *find_cluster(Qcow2CompressedWriteCache *s,
> +                                  int64_t cluster_offset)
> +{
> +    CacheCluster *c;
> +
> +    assert(QEMU_IS_ALIGNED(cluster_offset, s->cluster_size));
> +
> +    QTAILQ_FOREACH(c, &s->clusters, next) {
> +        if (cluster_offset == c->cluster_offset) {
> +            return c;
> +        }
> +    }

This reads like in any language but C we’d opt for a hash table.  On 
that thought, why don’t we opt for a hash table even in C?

> +
> +    return NULL;
> +}
> +
> +/* Creates "inactive" cluster, which doesn't influence s->nb_active_clusters */
> +static CacheCluster *cache_cluster_new(Qcow2CompressedWriteCache *s,
> +                                       int64_t cluster_offset)
> +{
> +    CacheCluster *c;
> +
> +    assert(!find_cluster(s, cluster_offset));
> +
> +    c = g_new(CacheCluster, 1);
> +    *c = (CacheCluster) {
> +        .cluster_offset = cluster_offset,
> +        .data_end = cluster_offset + s->cluster_size
> +    };
> +
> +    qemu_co_queue_init(&c->waiters);
> +    QTAILQ_INSERT_TAIL(&s->clusters, c, next);
> +
> +    return c;
> +}
> +
> +static void cache_cluster_free(CacheCluster *cluster)
> +{
> +    assert(!cluster->in_flight);
> +    assert(qemu_co_queue_empty(&cluster->waiters));
> +    QLIST_FOREACH_FUNC_SAFE(&cluster->extents, next, cache_extent_free);
> +    g_free(cluster);
> +}
> +
> +static bool cache_cluster_is_full(CacheCluster *cluster)
> +{
> +    assert(cluster->cluster_offset + cluster->n_bytes <= cluster->data_end);
> +    return cluster->cluster_offset + cluster->n_bytes == cluster->data_end;
> +}
> +
> +static void cache_cluster_remove(Qcow2CompressedWriteCache *s,
> +                                 CacheCluster *cluster)
> +{
> +    if (cluster->n_bytes) {
> +        s->nb_active_clusters--;
> +    }
> +    QTAILQ_REMOVE(&s->clusters, cluster, next);
> +    cache_cluster_free(cluster);
> +}
> +
> +/*
> + * Return number of consequtive clusters starting from @first. Store next after

%s/consequi\?tive/consecutive/

> + * last extent pointer into @next, store end offset of last extent into
> + * @end_off.
> + */
> +static int count_consequitive_extents(CacheExtent *first,
> +                                      CacheExtent **next,
> +                                      int64_t *end_off)
> +{
> +    CacheExtent *e;
> +    int64_t off = first->offset;
> +    int nb = 0;
> +
> +    for (e = first; e; e = QLIST_NEXT(e, next)) {
> +        assert(e->offset >= off);
> +        if (e->offset > off) {
> +            break;
> +        }
> +        off += e->bytes;
> +        nb++;
> +    }
> +
> +    if (next) {
> +        *next = e;
> +    }
> +
> +    if (end_off) {
> +        *end_off = off;
> +    }
> +
> +    return nb;
> +}
> +
> +/*
> + * Write consequtive extents, starting from @firest. Store next after last

*first

> + * extent pointer into @next. If align > 1, align end of the whole write by
> + * zero.
> + */
> +static int coroutine_fn flush_consequitive_extents(Qcow2CompressedWriteCache *s,
> +                                                   CacheExtent *first,
> +                                                   CacheExtent **next,
> +                                                   int64_t align)
> +{
> +    CacheExtent *e, *end_extent;
> +    int64_t end;
> +    int nb_extents = count_consequitive_extents(first, &end_extent, &end);

I’d prefer nb_bufs, because it will include a tail if present.

> +    int64_t aligned_end = QEMU_ALIGN_UP(end, align);
> +    int64_t tail = aligned_end - end;
> +    int64_t len = aligned_end - first->offset;
> +
> +    /*
> +     * Alignment if for flushing full cluster, first extent offset is always

s/.*,/The alignment will not exceed the cluster size, so the/?

> +     * aligned.
> +     */
> +    assert(QEMU_IS_ALIGNED(first->offset, align));
> +
> +    if (next) {
> +        *next = end_extent;
> +    }
> +
> +    if (tail) {
> +        nb_extents++;
> +    }
> +
> +    if (nb_extents > IOV_MAX) {
> +        g_autofree void *buf = g_malloc(len);

I wonder if this should be blk_blockalign() if first->offset is aligned 
to the cluster size.  (Don’t know how common that case is.)

> +        char *p = buf;
> +
> +        for (e = first; e != end_extent; e = QLIST_NEXT(e, next)) {
> +            memcpy(p, e->buf, e->bytes);
> +            p += e->bytes;
> +        }
> +
> +        if (tail) {
> +            memset(p, 0, tail);
> +        }
> +
> +        BLKDBG_EVENT(s->data_file, BLKDBG_WRITE_COMPRESSED);
> +        return bdrv_co_pwrite(s->data_file, first->offset, len, buf, 0);
> +    } else {
> +        int ret;
> +        QEMUIOVector qiov;
> +        g_autofree void *tail_buf = NULL;
> +
> +        qemu_iovec_init(&qiov, nb_extents);
> +        for (e = first; e != end_extent; e = QLIST_NEXT(e, next)) {
> +            qemu_iovec_add(&qiov, e->buf, e->bytes);
> +        }
> +
> +        if (tail) {
> +            tail_buf = g_malloc0(tail);
> +            qemu_iovec_add(&qiov, tail_buf, tail);
> +        }
> +
> +        BLKDBG_EVENT(s->data_file, BLKDBG_WRITE_COMPRESSED);
> +        ret = bdrv_co_pwritev(s->data_file, first->offset, len, &qiov, 0);
> +        qemu_iovec_destroy(&qiov);
> +        return ret;
> +    }

The write functions are missing overlap checks.  It can be argued that 
this is done by qcow2_co_pwritev_compressed_task() already, but the idea 
was to pair all actual writes with overlap checks (i.e., cached writes 
in qcow2_co_pwritev_compressed_task() wouldn’t need to do a check).

> +}
> +
> +static int coroutine_fn cache_cluster_flush_full(Qcow2CompressedWriteCache *s,
> +                                                 CacheCluster *cluster)
> +{
> +    int ret;
> +    CacheExtent *end_extent;
> +    int64_t align = MIN(s->cluster_size,
> +                        MAX(s->data_file->bs->bl.request_alignment, 4 * 1024));
> +
> +    assert(cache_cluster_is_full(cluster));
> +
> +    ret = flush_consequitive_extents(s, QLIST_FIRST(&cluster->extents),
> +                                     &end_extent, align);
> +
> +    assert(end_extent == NULL); /* all extents flushed */
> +
> +    return ret;
> +}
> +
> +static int coroutine_fn cache_cluster_flush(Qcow2CompressedWriteCache *s,
> +                                            CacheCluster *c)
> +{
> +    int ret;
> +    CacheExtent *e = QLIST_FIRST(&c->extents);
> +
> +    if (cache_cluster_is_full(c)) {
> +        return cache_cluster_flush_full(s, c);
> +    }
> +
> +    while (e) {
> +        ret = flush_consequitive_extents(s, e, &e, 1);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +int coroutine_fn qcow2_compressed_cache_co_flush(Qcow2CompressedWriteCache *s)
> +{
> +    int ret = 0;
> +    CacheCluster *c;
> +    GList *local_clusters = NULL, *p;
> +
> +    /*
> +     * Make a snapshot of current state: we will not flush clusters created in
> +     * parallel with flush operations and don't allow adding more extents to
> +     * staged clusters. We are also protected from parallel flush operations
> +     * flushing the same clusters.
> +     */
> +    QTAILQ_FOREACH(c, &s->clusters, next) {
> +        if (!c->in_flight && c->n_bytes) {
> +            c->in_flight = true;
> +            local_clusters = g_list_append(local_clusters, c);
> +        }
> +    }
> +
> +    for (p = local_clusters; p; p = p->next) {
> +        CacheCluster *c = p->data;
> +
> +        if (ret == 0) {
> +            ret = cache_cluster_flush(s, c);
> +        }
> +
> +        c->in_flight = false;
> +        qemu_co_queue_restart_all(&c->waiters);
> +
> +        if (ret == 0) {
> +            cache_cluster_remove(s, c);
> +        }
> +    }
> +
> +    g_list_free(local_clusters);
> +
> +    return ret;
> +}
> +
> +int coroutine_fn
> +qcow2_compressed_cache_co_stop_flush(Qcow2CompressedWriteCache *s)
> +{
> +    int ret, save;
> +
> +    save = s->max_active_clusters;
> +    s->max_active_clusters = 0; /* No more extents */

Perhaps better “Don’t cache any more extents” / “Don’t cache any more 
compressed clusters”?

> +
> +    ret = qcow2_compressed_cache_co_flush(s);
> +    if (ret < 0) {
> +        s->max_active_clusters = save;
> +        return ret;
> +    }
> +
> +    assert(QTAILQ_EMPTY(&s->clusters));
> +    return 0;
> +}
> +
> +/* @cluster takes ownership of @extent */
> +static void cluster_add_extent(Qcow2CompressedWriteCache *s,
> +                               CacheCluster *cluster, CacheExtent *extent)
> +{
> +    CacheExtent *e;
> +
> +    assert(extent->bytes);
> +    assert(extent->offset >= cluster->cluster_offset);
> +    assert(extent->offset + extent->bytes <= cluster->data_end);
> +    assert(!cluster->in_flight);
> +
> +    e = QLIST_FIRST(&cluster->extents);
> +    if (!e) {
> +        /* inactive cluster */
> +        assert(!cluster->n_bytes);
> +        s->nb_active_clusters++;
> +        assert(s->nb_active_clusters <= s->max_active_clusters);
> +        QLIST_INSERT_HEAD(&cluster->extents, extent, next);
> +    } else if (e->offset > extent->offset) {
> +        assert(extent->offset + extent->bytes <= e->offset);
> +        QLIST_INSERT_HEAD(&cluster->extents, extent, next);
> +    } else {
> +        while (QLIST_NEXT(e, next) &&
> +               QLIST_NEXT(e, next)->offset < extent->offset) {
> +            e = QLIST_NEXT(e, next);
> +        }
> +
> +        /* Now e is last element with offset < extent->offset */
> +        assert(e->offset + e->bytes <= extent->offset);
> +
> +        QLIST_INSERT_AFTER(e, extent, next);
> +
> +        e = QLIST_NEXT(extent, next);
> +        if (e) {
> +            assert(extent->offset + extent->bytes <= e->offset);
> +        }
> +    }

I like how in C one always tends to implement everything from scratch. 
I think adding an element into an already-sorted list is a standard 
problem, so it’s a shame that it’s implemented from scratch here.  Also 
that it doesn’t use binary search, which would require a random-access 
list, so...

Well.  Not sure if it’s necessary.  Probably not.  Although I find it 
weird to start the search from the front when I’d imagine it’s more 
likely that new compressed clusters (extents) are added to the back.

> +
> +    cluster->n_bytes += extent->bytes;
> +}
> +
> +static CacheCluster *find_cluster_to_flush(Qcow2CompressedWriteCache *s)
> +{
> +    CacheCluster *c;
> +
> +    QTAILQ_FOREACH(c, &s->clusters, next) {
> +        if (!c->in_flight && cache_cluster_is_full(c)) {
> +            return c;

I don’t like it very much that this cache is built as an R/W cache, i.e. 
that entries are retained until they need to be flushed because a new 
entry needs space.

It was my impression this was meant as a write cache, that should speed 
up specifically O_DIRECT operation.  To me, that implies that entries 
are flushed once they are done, which is precisely why it works for 
compressed clusters, because we know when those are done.

I suppose nobody cares in practice because compression is basically only 
used for writing out whole images (so displacing clusters happens all 
the time, and once you’re done, the image is flushed and closed), but it 
just looks strange to me.

> +        }
> +    }
> +
> +    return NULL;
> +}
> +
> +/* Cache an extent if there is a place */
> +static bool coroutine_fn
> +try_cache_extent(Qcow2CompressedWriteCache *s, CacheExtent *extent,
> +                 bool *cluster_in_flight)

Like with other functions, but here especially, I would have liked some 
documentation on the interface.  I suppose the return value reflects 
whether the “try” worked (so a good guess is possible to me, but it’s 
still a guess at this point).

I have no idea what *cluster_in_flight means.  I would have guessed it 
means whether the host cluster touched by the compressed cluster is in 
flight, but it’s also set when the cache is disabled.  The caller seems 
to interpret it as “write this immediately”, which to me actually seems 
wrong, but more on that in said caller function below.

Perhaps this function should just have an enum return value that tells 
the caller precisely what to do with some expressively named values.

> +{
> +    CacheCluster *c;
> +    int64_t cluster_offset = QEMU_ALIGN_DOWN(extent->offset, s->cluster_size);
> +
> +    assert(extent->bytes);
> +
> +    if (s->max_active_clusters == 0) {
> +        *cluster_in_flight = true;
> +        return false;
> +    }
> +
> +    *cluster_in_flight = false;
> +
> +    c = find_cluster(s, cluster_offset);
> +    if (c && c->in_flight) {
> +        *cluster_in_flight = true;
> +        return false;
> +    }
> +    if (s->nb_active_clusters >= s->max_active_clusters &&
> +        (!c || !c->n_bytes))
> +    {
> +        /*
> +         * Cache is full, we can't allocate a new cluster and can't activate
> +         * existing inactive cluster
> +         */
> +        return false;
> +    }
> +
> +    if (!c) {
> +        c = cache_cluster_new(s, cluster_offset);
> +    }
> +
> +    cluster_add_extent(s, c, extent);
> +
> +    if (cache_cluster_is_full(c)) {
> +        qemu_co_queue_restart_all(&s->waiters);

This interface is unclear.  The documentation of s->waiters says to wake 
them up once there is a free space in the cache, but that isn’t the case 
here.  It’s true that it’s easy to make a free space (by flushing the 
full cluster), but there is no free space.

This also ties in to my uneasiness about how full clusters aren’t 
flushed immediately.

> +    }
> +
> +    return true;
> +}
> +
> +/* Takes ownership of @buf, don't free it after call! */
> +int coroutine_fn
> +qcow2_compressed_cache_co_write(Qcow2CompressedWriteCache *s, int64_t offset,
> +                                int64_t bytes, void *buf)
> +{
> +    int ret;
> +    int i;
> +    CacheExtent *extents[] = {NULL, NULL};
> +    int nb_extents = 0; /* number of non-NULL elements in @extents */
> +    int64_t e0_len;
> +
> +    assert(bytes > 0);
> +    assert(bytes < s->cluster_size);
> +
> +    e0_len = MIN(bytes, QEMU_ALIGN_UP(offset + 1, s->cluster_size) - offset);
> +    extents[0] = cache_extent_new(offset, e0_len, buf);
> +    nb_extents = 1;
> +
> +    if (bytes > e0_len) {
> +        int64_t e1_len = bytes - e0_len;
> +        /*
> +         * We have to allocate a new buffer, so that clusters are independent
> +         * and can free their extents when needed.
> +         */
> +        void *e1_buf = g_memdup(((const char *)buf) + e0_len, e1_len);
> +
> +        extents[1] = cache_extent_new(offset + e0_len, e1_len, e1_buf);
> +        nb_extents = 2;
> +    }
> +
> +    while (nb_extents) {
> +        bool do_write = false;
> +
> +        for (i = 0; i < 2; i++) {
> +            CacheExtent *e = extents[i];
> +
> +            do_write = false;
> +
> +            if (!e) {
> +                continue;
> +            }
> +
> +            if (try_cache_extent(s, e, &do_write)) {
> +                extents[i] = NULL;
> +                nb_extents--;
> +                continue;
> +            }
> +
> +            if (do_write) {
> +                ret = bdrv_co_pwrite(s->data_file, e->offset, e->bytes,
> +                                     e->buf, 0);

Is this safe?  do_write is set if the host cluster touched by this 
extent is in flight.  Because it can’t be full, it must currently be in 
the process of being flushed by cache_cluster_flush(), which flushes all 
consecutive areas with flush_consequitive_extents().  If the underlying 
file has some request alignment (which it tends to do with O_DIRECT), 
then those writes are likely transformed into RMW operations spanning 
more than their precise byte range.  Isn’t it possible that one of those 
RMWs then intersects with this write here?

If the host cluster is in flight, shouldn’t we just wait until it’s flushed?

> +
> +                cache_extent_free(e);
> +                extents[i] = NULL;
> +                nb_extents--;
> +
> +                if (ret < 0) {
> +                    goto out;
> +                }
> +            }
> +        }
> +
> +        if (do_write) {
> +            /*
> +             * We yielded during second extent writing. Probably the cache is
> +             * already free and we can now cache the first extent.

Where do we yield?  Do you mean it probably takes time to do the write, 
so it’s possible that if do_write is true because some cluster was in 
the process of being flushed, it is now removed from the cache, so we 
have a free entry?

Why not just check for whether we have a free entry?

Also, it’s possible that do_write is true just because the cache is 
disabled, which I guess is handled fine because then both extents will 
have been written to disk, so that nb_extents is now 0...  But that 
behavior isn’t documented here.

> +             */
> +            continue;
> +        }
> +
> +        if (nb_extents) {
> +            CacheCluster *cluster = find_cluster_to_flush(s);
> +
> +            if (cluster) {
> +                cluster->in_flight = true;
> +                ret = cache_cluster_flush_full(s, cluster);
> +                cluster->in_flight = false;
> +                qemu_co_queue_restart_all(&cluster->waiters);
> +                qemu_co_queue_restart_all(&s->waiters);

Looks weird to wake up other waiters here, because we are very likely 
going to take that spot, so it’s likely that nobody will be able to make 
use of it after we took another iteration.

It’s also weird because it’s different than the two other places which 
wake up s->waiters not if there’s a free space, but if a free space can 
be made by flushing find_cluster_to_flush().  As in, this is the only 
place where the queue is used as it is documented.

I think this again ties in with the question whether we shouldn’t flush 
clusters as soon as they are full instead of waiting until we need 
another free entry.

> +                if (ret < 0) {
> +                    goto out;
> +                }
> +                cache_cluster_remove(s, cluster);
> +                continue;
> +            }
> +
> +            qemu_co_queue_wait(&s->waiters, NULL);
> +        }
> +    }
> +
> +    ret = 0;
> +
> +out:
> +    for (i = 0; i < 2; i++) {
> +        cache_extent_free(extents[i]);
> +    }
> +
> +    return 0;

return ret, I think.

> +}
> +
> +int coroutine_fn
> +qcow2_compressed_cache_co_read(Qcow2CompressedWriteCache *s, int64_t offset,
> +                               int64_t bytes, void *buf)
> +{
> +    CacheCluster *c;
> +    CacheExtent *e;
> +    int64_t cluster_offset = QEMU_ALIGN_DOWN(offset, s->cluster_size);
> +
> +    c = find_cluster(s, cluster_offset);
> +    if (c) {
> +        QLIST_FOREACH(e, &c->extents, next) {
> +            if (e->offset == offset && e->bytes <= bytes) {
> +                memcpy(buf, e->buf, e->bytes);
> +                return 0;
> +            }

Again, reads like it should be a binary search.

I suppose e->bytes < bytes is OK because the only caller of this 
function is qcow2_co_preadv_compressed(), which passes a maximum length, 
effectively.  But then I’d call the parameter accordingly (i.e., 
max_bytes) and zero out the tail.

> +        }
> +    }
> +
> +    return bdrv_co_pread(s->data_file, offset, bytes, buf, 0);
> +}
> +
> +/*
> + * Caller states, that there would be no writes to this cluster beyond
> + * specified @cluster_data_end. So, it's OK to flush cluster when it is filled
> + * up to @cluster_data_end and it's OK to align flushing write operation up to
> + * some alignment (not greater than cluster_size of course).
> + */
> +void coroutine_fn
> +qcow2_compressed_cache_co_set_cluster_end(Qcow2CompressedWriteCache *s,
> +                                          int64_t cluster_data_end)
> +{
> +    int64_t cluster_offset = QEMU_ALIGN_DOWN(cluster_data_end, s->cluster_size);

If cluster_data_end is aligned to a cluster boundary (I don’t think the 
caller does that, but...) this will do nothing, so cluster_offset is 
likely to be one cluster too high.

I wonder whether we should assert that cluster_data_end is not aligned 
to the cluster size, or whether we should use cluster_data_end - 1 here, 
or whether the caller should pass cluster_offset.

> +    CacheExtent *e;
> +    CacheCluster *c;
> +
> +    c = find_cluster(s, cluster_offset);
> +    if (!c) {
> +        c = cache_cluster_new(s, cluster_offset);

Is this case reachable?  How?

> +    }
> +
> +    QLIST_FOREACH(e, &c->extents, next) {
> +        assert(e->offset + e->bytes <= cluster_data_end);
> +    }
> +
> +    /* Shouldn't set data_end several times */
> +    assert(c->data_end == c->cluster_offset + s->cluster_size);
> +
> +    c->data_end = cluster_data_end;
> +    if (cache_cluster_is_full(c)) {
> +        qemu_co_queue_restart_all(&s->waiters);

Like in try_cache_extent(), this wakes up the queue while technically 
there is no free space.  I get the intention here, because as documented 
somewhere, this function is called from a mutexed section and we 
probably don’t want to write data to the data_file here.

But it would make sense for me to flush the cluster later as soon as 
possible, and then wake up the waiters.

> +    }
> +}
> +
> +Qcow2CompressedWriteCache *qcow2_compressed_cache_new(BdrvChild *data_file,
> +                                                      int64_t cluster_size,
> +                                                      int64_t cache_size)
> +{
> +    Qcow2CompressedWriteCache *s = g_new(Qcow2CompressedWriteCache, 1);
> +
> +    assert(cache_size >= cluster_size);
> +
> +    *s = (Qcow2CompressedWriteCache) {
> +        .data_file = data_file,
> +        .cluster_size = cluster_size,
> +        .max_active_clusters = cache_size / cluster_size,
> +    };
> +
> +    qemu_co_queue_init(&s->waiters);
> +    QTAILQ_INIT(&s->clusters);
> +
> +    return s;
> +}
> +
> +void qcow2_compressed_cache_free(Qcow2CompressedWriteCache *s)
> +{
> +    if (!s) {
> +        return;
> +    }
> +
> +    QTAILQ_FOREACH_FUNC_SAFE(&s->clusters, next, cache_cluster_free);

It seems strange to me to call this function if there are still clusters 
in the cache.  I suppose you can’t do anything else if the flush 
function failed to evict them, so I suppose it’s correct to do this, but 
it makes me flinch a bit.

> +    g_free(s);
> +}
> +
> +void qcow2_compressed_cache_set_size(Qcow2CompressedWriteCache *s,
> +                                     int64_t size)
> +{
> +    /*
> +     * We don't do flush here. Don't care too much, it's safe to have cache
> +     * larger than maximum, it will only decrease until it reaches the new
> +     * maximum.
> +     */
> +    assert(size >= s->cluster_size);
> +    s->max_active_clusters = size / s->cluster_size;
> +}

Simple enough, but I don’t see why we need it.  More on that in the next 
patch, though.

> +
> +void coroutine_fn
> +qcow2_compressed_cache_co_discard(Qcow2CompressedWriteCache *s,
> +                                  int64_t cluster_offset)

I don’t like this function’s name because I associate “discard” with 
discarding data.  I think “evict” would fit better, or perhaps “drop”. 
(I think “drop” better conveys that the data isn’t supposed to be 
written back.)

Max

> +{
> +    CacheCluster *c;
> +
> +    while (true) {
> +        c = find_cluster(s, cluster_offset);
> +        if (!c) {
> +            return;
> +        }
> +        if (!c->in_flight) {
> +            cache_cluster_remove(s, c);
> +            return;
> +        }
> +        qemu_co_queue_wait(&c->waiters, NULL);
> +    }
> +}
> +
> +
> +/*
> + * Wrappers for qcow2_compressed_cache_co_flush
> + *
> + * TODO: update scripts/block-coroutine-wrapper to generate this too
> + */
> +
> +typedef struct Qcow2CompressedCacheCoFlush {
> +    BdrvPollCo poll_state;
> +    Qcow2CompressedWriteCache *state;
> +} Qcow2CompressedCacheCoFlush;
> +
> +static void coroutine_fn
> +qcow2_compressed_cache_co_flush_entry(void *opaque)
> +{
> +    Qcow2CompressedCacheCoFlush *s = opaque;
> +
> +    s->poll_state.ret = qcow2_compressed_cache_co_flush(s->state);
> +    s->poll_state.in_progress = false;
> +
> +    aio_wait_kick();
> +}
> +
> +int qcow2_compressed_cache_flush(BlockDriverState *bs,
> +                                 Qcow2CompressedWriteCache *state)
> +{
> +    if (qemu_in_coroutine()) {
> +        return qcow2_compressed_cache_co_flush(state);
> +    } else {
> +        Qcow2CompressedCacheCoFlush s = {
> +            .poll_state.bs = bs,
> +            .poll_state.in_progress = true,
> +
> +            .state = state,
> +        };
> +
> +        s.poll_state.co =
> +            qemu_coroutine_create(qcow2_compressed_cache_co_flush_entry, &s);
> +
> +        return bdrv_poll_co(&s.poll_state);
> +    }
> +}
> +
> +/*
> + * Wrappers for qcow2_compressed_cache_co_stop_flush
> + *
> + * TODO: update scripts/block-coroutine-wrapper to generate this too
> + */
> +
> +typedef struct Qcow2CompressedCacheCoStopFlush {
> +    BdrvPollCo poll_state;
> +    Qcow2CompressedWriteCache *state;
> +} Qcow2CompressedCacheCoStopFlush;
> +
> +static void coroutine_fn
> +qcow2_compressed_cache_co_stop_flush_entry(void *opaque)
> +{
> +    Qcow2CompressedCacheCoStopFlush *s = opaque;
> +
> +    s->poll_state.ret = qcow2_compressed_cache_co_stop_flush(s->state);
> +    s->poll_state.in_progress = false;
> +
> +    aio_wait_kick();
> +}
> +
> +int qcow2_compressed_cache_stop_flush(BlockDriverState *bs,
> +                                      Qcow2CompressedWriteCache *state)
> +{
> +    if (qemu_in_coroutine()) {
> +        return qcow2_compressed_cache_co_stop_flush(state);
> +    } else {
> +        Qcow2CompressedCacheCoStopFlush s = {
> +            .poll_state.bs = bs,
> +            .poll_state.in_progress = true,
> +
> +            .state = state,
> +        };
> +
> +        s.poll_state.co =
> +            qemu_coroutine_create(qcow2_compressed_cache_co_stop_flush_entry,
> +                                  &s);
> +
> +        return bdrv_poll_co(&s.poll_state);
> +    }
> +}
> diff --git a/block/meson.build b/block/meson.build
> index eeaefe5809..7b94794c28 100644
> --- a/block/meson.build
> +++ b/block/meson.build
> @@ -25,6 +25,7 @@ block_ss.add(files(
>     'qcow2-bitmap.c',
>     'qcow2-cache.c',
>     'qcow2-cluster.c',
> +  'qcow2-compressed-write-cache.c',
>     'qcow2-refcount.c',
>     'qcow2-snapshot.c',
>     'qcow2-threads.c',
> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/7] qemu/queue: add some useful QLIST_ and QTAILQ_ macros
  2021-01-29 16:50 ` [PATCH 1/7] qemu/queue: add some useful QLIST_ and QTAILQ_ macros Vladimir Sementsov-Ogievskiy
  2021-02-01  8:29   ` Markus Armbruster
@ 2021-02-10 17:07   ` Max Reitz
  1 sibling, 0 replies; 32+ messages in thread
From: Max Reitz @ 2021-02-10 17:07 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: kwolf, den, qemu-devel, armbru

On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
> Add QLIST_FOREACH_FUNC_SAFE(), QTAILQ_FOREACH_FUNC_SAFE() and
> QTAILQ_POP_HEAD(), to be used in following commit.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>   include/qemu/queue.h | 14 ++++++++++++++
>   1 file changed, 14 insertions(+)
> 
> diff --git a/include/qemu/queue.h b/include/qemu/queue.h
> index e029e7bf66..03e1fce85f 100644
> --- a/include/qemu/queue.h
> +++ b/include/qemu/queue.h
> @@ -173,6 +173,13 @@ struct {                                                                \
>                   (var) && ((next_var) = ((var)->field.le_next), 1);      \
>                   (var) = (next_var))
>   
> +#define QLIST_FOREACH_FUNC_SAFE(head, field, func) do {                 \
> +    typeof(*QLIST_FIRST(head)) *qffs_var, *qffs_next_var;               \
> +    QLIST_FOREACH_SAFE(qffs_var, (head), field, qffs_next_var) {        \
> +        (func)(qffs_var);                                               \
> +    }                                                                   \
> +} while (/*CONSTCOND*/0)
> +

On one hand I have inexplicable reservations against adding these macros 
if they’re only used one time in the next patch.

On the other, I have one clearly expressible reservation, and that’s the 
fact that perhaps some future functions that could make use of this want 
to take more arguments, like closures.

Could we make these function vararg macros?  I.e., e.g.,

#define QLIST_FOREACH_FUNC_SAFE(head, field, func, ...) do {
     ...
         (func)(qffs_var, ## __VA_ARGS__);
     ...

Max

>   /*
>    * List access methods.
>    */
> @@ -490,6 +497,13 @@ union {                                                                 \
>                (var) && ((prev_var) = QTAILQ_PREV(var, field), 1);        \
>                (var) = (prev_var))
>   
> +#define QTAILQ_FOREACH_FUNC_SAFE(head, field, func) do {                \
> +    typeof(*QTAILQ_FIRST(head)) *qffs_var, *qffs_next_var;              \
> +    QTAILQ_FOREACH_SAFE(qffs_var, (head), field, qffs_next_var) {       \
> +        (func)(qffs_var);                                               \
> +    }                                                                   \
> +} while (/*CONSTCOND*/0)
> +
>   /*
>    * Tail queue access methods.
>    */
> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/7] block/qcow2: use compressed write cache
  2021-01-29 16:50 ` [PATCH 3/7] block/qcow2: use compressed write cache Vladimir Sementsov-Ogievskiy
@ 2021-02-10 17:11   ` Max Reitz
  2021-02-11 12:53     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 32+ messages in thread
From: Max Reitz @ 2021-02-10 17:11 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: kwolf, den, qemu-devel, armbru

On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
> Introduce a new option: compressed-cache-size, with default to 64
> clusters (to be not less than 64 default max-workers for backup job).
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>   qapi/block-core.json   |  8 +++-
>   block/qcow2.h          |  4 ++
>   block/qcow2-refcount.c | 13 +++++++
>   block/qcow2.c          | 87 ++++++++++++++++++++++++++++++++++++++++--
>   4 files changed, 108 insertions(+), 4 deletions(-)
> 
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index 9f555d5c1d..e0be6657f3 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -3202,6 +3202,11 @@
>   #             an image, the data file name is loaded from the image
>   #             file. (since 4.0)
>   #
> +# @compressed-cache-size: The maximum size of compressed write cache in
> +#                         bytes. If positive must be not less than
> +#                         cluster size. 0 disables the feature. Default
> +#                         is 64 * cluster_size. (since 6.0)

Do we need this, really?  If you don’t use compression, the cache won’t 
use any memory, right?  Do you plan on using this option?

I’d just set it to a sane default.

OTOH, “a sane default” poses two questions, namely whether 64 * 
cluster_size is reasonable – with subclusters, the cluster size may be 
rather high, so 64 * cluster_size may well be like 128 MB.  Are 64 
clusters really necessary for a reasonable performance?

Second, I think I could live with a rather high default if clusters are 
flushed as soon as they are full.  OTOH, as I briefly touched on, in 
practice, I suppose compressed images are just written to constantly, so 
even if clusters are flushed as soon as they are full, the cache will 
still remain full all the time.


Different topic: Why is the cache disableable?  I thought there are no 
downsides?

(Not being able to disable it would make the code simpler, hence me asking.)

Max



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/7] block/qcow2: introduce cache for compressed writes
  2021-02-10 17:07   ` Max Reitz
@ 2021-02-11 12:49     ` Vladimir Sementsov-Ogievskiy
  2021-02-18 15:04       ` Max Reitz
  0 siblings, 1 reply; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-02-11 12:49 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, eblake, armbru, kwolf, den

you may jump first to my last inline answer

10.02.2021 20:07, Max Reitz wrote:
> On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
>> Compressed writes and O_DIRECT are not friends: they works too slow,
>> because compressed writes does many small unaligned to 512 writes.
>>
>> Let's introduce an internal cache, so that compressed writes may work
>> well when O_DIRECT is on.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   block/qcow2.h                        |  29 +
>>   block/qcow2-compressed-write-cache.c | 770 +++++++++++++++++++++++++++
>>   block/meson.build                    |   1 +
>>   3 files changed, 800 insertions(+)
>>   create mode 100644 block/qcow2-compressed-write-cache.c
>>
>> diff --git a/block/qcow2.h b/block/qcow2.h
>> index 0678073b74..fbdedf89fa 100644
>> --- a/block/qcow2.h
>> +++ b/block/qcow2.h
>> @@ -322,6 +322,8 @@ typedef struct Qcow2BitmapHeaderExt {
>>       uint64_t bitmap_directory_offset;
>>   } QEMU_PACKED Qcow2BitmapHeaderExt;
>> +typedef struct Qcow2CompressedWriteCache Qcow2CompressedWriteCache;
>> +
>>   #define QCOW2_MAX_THREADS 4
>>   typedef struct BDRVQcow2State {
>> @@ -1010,4 +1012,31 @@ int coroutine_fn
>>   qcow2_co_decrypt(BlockDriverState *bs, uint64_t host_offset,
>>                    uint64_t guest_offset, void *buf, size_t len);
>> +Qcow2CompressedWriteCache *qcow2_compressed_cache_new(BdrvChild *data_file,
>> +                                                      int64_t cluster_size,
>> +                                                      int64_t cache_size);
>> +void qcow2_compressed_cache_free(Qcow2CompressedWriteCache *s);
>> +int coroutine_fn
>> +qcow2_compressed_cache_co_read(Qcow2CompressedWriteCache *s, int64_t offset,
>> +                               int64_t bytes, void *buf);
>> +int coroutine_fn
>> +qcow2_compressed_cache_co_write(Qcow2CompressedWriteCache *s, int64_t offset,
>> +                                int64_t bytes, void *buf);
>> +void coroutine_fn
>> +qcow2_compressed_cache_co_set_cluster_end(Qcow2CompressedWriteCache *s,
>> +                                          int64_t cluster_data_end);
>> +int coroutine_fn
>> +qcow2_compressed_cache_co_flush(Qcow2CompressedWriteCache *s);
>> +int qcow2_compressed_cache_flush(BlockDriverState *bs,
>> +                                 Qcow2CompressedWriteCache *state);
>> +int coroutine_fn
>> +qcow2_compressed_cache_co_stop_flush(Qcow2CompressedWriteCache *s);
>> +int qcow2_compressed_cache_stop_flush(BlockDriverState *bs,
>> +                                      Qcow2CompressedWriteCache *s);
>> +void qcow2_compressed_cache_set_size(Qcow2CompressedWriteCache *s,
>> +                                     int64_t size);
>> +void coroutine_fn
>> +qcow2_compressed_cache_co_discard(Qcow2CompressedWriteCache *s,
>> +                                  int64_t cluster_offset);
>> +
> 
> It would be nice if these functions had their interface documented somewhere.

I tried to comment dificult things in .c... Is there a prefernce, where to document
how and what function does, in .h or in .c ?

> 
>>   #endif
>> diff --git a/block/qcow2-compressed-write-cache.c b/block/qcow2-compressed-write-cache.c
>> new file mode 100644
>> index 0000000000..7bb92cb550
>> --- /dev/null
>> +++ b/block/qcow2-compressed-write-cache.c
>> @@ -0,0 +1,770 @@
>> +/*
>> + * Write cache for qcow2 compressed writes
>> + *
>> + * Copyright (c) 2021 Virtuozzo International GmbH.
>> + *
>> + * Permission is hereby granted, free of charge, to any person obtaining a copy
>> + * of this software and associated documentation files (the "Software"), to deal
>> + * in the Software without restriction, including without limitation the rights
>> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
>> + * copies of the Software, and to permit persons to whom the Software is
>> + * furnished to do so, subject to the following conditions:
>> + *
>> + * The above copyright notice and this permission notice shall be included in
>> + * all copies or substantial portions of the Software.
>> + *
>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
>> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
>> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>> + * THE SOFTWARE.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +
>> +#include "block/block_int.h"
>> +#include "block/block-gen.h"
>> +#include "qemu/coroutine.h"
>> +#include "qapi/qapi-events-block-core.h"
>> +#include "qcow2.h"
>> +
>> +typedef struct CacheExtent {
>> +    int64_t offset;
>> +    int64_t bytes;
>> +    void *buf;
>> +    QLIST_ENTRY(CacheExtent) next;
>> +} CacheExtent;
>> +
>> +typedef struct CacheCluster {
> 
> It isn’t immediately clear what these two structures mean by just their name, because “extent” has no meaning in the context of qcow2.

It's not important for the cache are extents compressed clusters or not.. So I'd keep more generic name

> 
> I understand CacheExtent to basically be a compressed cluster, and CacheCluster to be a host cluster.  Perhaps their names should reflect that.
> 
> (OTOH, the Cache* prefix seems unnecessary to me, because these are just local structs.)
> 
>> +    int64_t cluster_offset;
>> +    int64_t n_bytes; /* sum of extents lengths */
>> +
>> +    /*
>> +     * data_end: cluster is full if data_end reached and ready to be flushed.
>> +     * data_end is absoluted offset, not relative.
> 
> *absolute
> 
>> +     */
>> +    int64_t data_end;
>> +
>> +    bool in_flight; /* cluster is being flushed now */
>> +
>> +    /*
>> +     * waiters: coroutines to wake after flush.
> 
> No, once in_flight is reset to false, which may happen even if the cluster hasn’t been flushed.
> 
> 
> As a note that doesn’t really belong anywhere, I don’t like the current use of this queue perfectly well.
> 
> This is how it is used right now:
> 
> Some context A sets in_flight and flushes the cluster.
> 
> Some context B wants to evict the cluster from the cache.  It sees in_flight, so it gets into the queue.  (If in_flight were not set, B would remove the cluster from the cache.)
> 
> A finishes flushing, resets in_flight, queues B.  Then if flushing succeeded, it will evict the cluster from the cache.  If it failed, the cluster remains in the cache.
> 
> Then B is executed.  If the cluster isn’t in the cache anymore, it’s done.  If still is, it will remove it, and then be done.

all correct

> 
> My problem is that if B were executed right where it’s enqueued (with queue_restart_all()), we would crash because B would see the cluster in the cache, remove it, and then A would have a dangling reference and remove the (freed) cluster again.

But queue_restart_all() doesn't execute B immediately, but only when A yield. And I think a lot of code rely on this.

> 
> So, well, it works, but I personally find it strange to “invoke“ B where you don’t actually want it to run.  I think the @waiters queue should only be restarted once A has had the chance to remove the cluster from the cache, or to be more general, once A no longer has a reference to the cluster.

You are right it looks strange. Why I've written it this way? I dont remember :\ .. Will try to reorder.

> 
> tl;dr: I think the @waiters queue should only be restarted once the reference to the cluster is no longer going to be used.

yes, that will look better.

> 
>>                                                  Must be empty when in_flight is
>> +     * false
>> +     */
>> +    CoQueue waiters;
>> +
>> +    QTAILQ_ENTRY(CacheCluster) next;
>> +    QLIST_HEAD(, CacheExtent) extents; /* sorted by offset */
>> +} CacheCluster;
>> +
>> +typedef QTAILQ_HEAD(, CacheCluster) CacheClusterQueue;
> 
> Why the typedef?  It’s used one time.

Haha, good question. At some stage of developepment I had two queue, separate for complete clusters, and passed these queues to functions.. But I decided that this makes me to go through two queues in same places instead of one and this doesn't make things simpler and safer. So I've dropped it. But typedef was left. Will drop it as well.

> 
>> +
>> +struct Qcow2CompressedWriteCache {
>> +    /*
>> +     * @data_file and @cluster_size are copied from qcow2 state. Not huge
>> +     * duplications, seems better to avoid accessing the whole qcow2 state
>> +     * instead.
> 
> Why?  What if in the future the data_file child can be changed with reopen?

I didn't considered this.. Still it's impossible now, yes? But would be possible in future of course.

> 
> What’s the argument against accessing the whole qcow2 state?  That it’s a layering violation, that we shouldn’t access it while the rest uses it? 

To keep it simple to reuse the cache for something other.

> If that’s a concern, perhaps *data_file should be an indirect pointer to the pointer in the BDRVQcow2State.

Or we should update it on reopen (like cache_size)

> 
>> +     */
>> +    BdrvChild *data_file;
>> +    int64_t cluster_size;
>> +
>> +    CoQueue waiters; /* coroutines, waiting for free place in the cache */
>> +
>> +    /*
>> +     * Cache limits not total number of clusters but total number of active
>> +     * clusters. Active clusters are clusters with non-zero @n_bytes (and
>> +     * therefor non-empty @extents). This is done so we don't need to wait for
>> +     * cache flush qcow2_compressed_cache_set_cluster_end() (which may create
>> +     * cluster with @data_end set but zero @n_bytes), as
>> +     * qcow2_compressed_cache_set_cluster_end() is intended to be called from
>> +     * qcow2 mutex critical section.
>> +     */
>> +    int nb_active_clusters;
>> +
>> +    /*
>> +     * If max_active_clusters is 0 it means that cache is inactive: all new
>> +     * writes should fallthrough to data_file immediately.
>> +     */
>> +    int max_active_clusters;
>> +
>> +    CacheClusterQueue clusters;
>> +};
>> +
>> +/* New extent takes ownership of @buf */
>> +static CacheExtent *cache_extent_new(int64_t offset, int64_t bytes, void *buf)
>> +{
>> +    CacheExtent *e = g_new(CacheExtent, 1);
>> +
>> +    *e = (CacheExtent) {
>> +        .offset = offset,
>> +        .bytes = bytes,
>> +        .buf = buf,
>> +    };
>> +
>> +    return e;
>> +}
>> +
>> +static void cache_extent_free(CacheExtent *e)
>> +{
>> +    if (e) {
>> +        g_free(e->buf);
>> +        g_free(e);
>> +    }
>> +}
>> +
>> +static CacheCluster *find_cluster(Qcow2CompressedWriteCache *s,
>> +                                  int64_t cluster_offset)
>> +{
>> +    CacheCluster *c;
>> +
>> +    assert(QEMU_IS_ALIGNED(cluster_offset, s->cluster_size));
>> +
>> +    QTAILQ_FOREACH(c, &s->clusters, next) {
>> +        if (cluster_offset == c->cluster_offset) {
>> +            return c;
>> +        }
>> +    }
> 
> This reads like in any language but C we’d opt for a hash table.  On that thought, why don’t we opt for a hash table even in C?

Agree why not. I just tried to avoid difficulties that can be avoided for the first run.

> 
>> +
>> +    return NULL;
>> +}
>> +
>> +/* Creates "inactive" cluster, which doesn't influence s->nb_active_clusters */
>> +static CacheCluster *cache_cluster_new(Qcow2CompressedWriteCache *s,
>> +                                       int64_t cluster_offset)
>> +{
>> +    CacheCluster *c;
>> +
>> +    assert(!find_cluster(s, cluster_offset));
>> +
>> +    c = g_new(CacheCluster, 1);
>> +    *c = (CacheCluster) {
>> +        .cluster_offset = cluster_offset,
>> +        .data_end = cluster_offset + s->cluster_size
>> +    };
>> +
>> +    qemu_co_queue_init(&c->waiters);
>> +    QTAILQ_INSERT_TAIL(&s->clusters, c, next);
>> +
>> +    return c;
>> +}
>> +
>> +static void cache_cluster_free(CacheCluster *cluster)
>> +{
>> +    assert(!cluster->in_flight);
>> +    assert(qemu_co_queue_empty(&cluster->waiters));
>> +    QLIST_FOREACH_FUNC_SAFE(&cluster->extents, next, cache_extent_free);
>> +    g_free(cluster);
>> +}
>> +
>> +static bool cache_cluster_is_full(CacheCluster *cluster)
>> +{
>> +    assert(cluster->cluster_offset + cluster->n_bytes <= cluster->data_end);
>> +    return cluster->cluster_offset + cluster->n_bytes == cluster->data_end;
>> +}
>> +
>> +static void cache_cluster_remove(Qcow2CompressedWriteCache *s,
>> +                                 CacheCluster *cluster)
>> +{
>> +    if (cluster->n_bytes) {
>> +        s->nb_active_clusters--;
>> +    }
>> +    QTAILQ_REMOVE(&s->clusters, cluster, next);
>> +    cache_cluster_free(cluster);
>> +}
>> +
>> +/*
>> + * Return number of consequtive clusters starting from @first. Store next after
> 
> %s/consequi\?tive/consecutive/
> 
>> + * last extent pointer into @next, store end offset of last extent into
>> + * @end_off.
>> + */
>> +static int count_consequitive_extents(CacheExtent *first,
>> +                                      CacheExtent **next,
>> +                                      int64_t *end_off)
>> +{
>> +    CacheExtent *e;
>> +    int64_t off = first->offset;
>> +    int nb = 0;
>> +
>> +    for (e = first; e; e = QLIST_NEXT(e, next)) {
>> +        assert(e->offset >= off);
>> +        if (e->offset > off) {
>> +            break;
>> +        }
>> +        off += e->bytes;
>> +        nb++;
>> +    }
>> +
>> +    if (next) {
>> +        *next = e;
>> +    }
>> +
>> +    if (end_off) {
>> +        *end_off = off;
>> +    }
>> +
>> +    return nb;
>> +}
>> +
>> +/*
>> + * Write consequtive extents, starting from @firest. Store next after last
> 
> *first
> 
>> + * extent pointer into @next. If align > 1, align end of the whole write by
>> + * zero.
>> + */
>> +static int coroutine_fn flush_consequitive_extents(Qcow2CompressedWriteCache *s,
>> +                                                   CacheExtent *first,
>> +                                                   CacheExtent **next,
>> +                                                   int64_t align)
>> +{
>> +    CacheExtent *e, *end_extent;
>> +    int64_t end;
>> +    int nb_extents = count_consequitive_extents(first, &end_extent, &end);
> 
> I’d prefer nb_bufs, because it will include a tail if present.
> 
>> +    int64_t aligned_end = QEMU_ALIGN_UP(end, align);
>> +    int64_t tail = aligned_end - end;
>> +    int64_t len = aligned_end - first->offset;
>> +
>> +    /*
>> +     * Alignment if for flushing full cluster, first extent offset is always
> 
> s/.*,/The alignment will not exceed the cluster size, so the/?

s/if/is/. When cluster is "full", which means it is filled up to its data_end, the remaining tail is unsed, and we can align it up.

> 
>> +     * aligned.
>> +     */
>> +    assert(QEMU_IS_ALIGNED(first->offset, align));
>> +
>> +    if (next) {
>> +        *next = end_extent;
>> +    }
>> +
>> +    if (tail) {
>> +        nb_extents++;
>> +    }
>> +
>> +    if (nb_extents > IOV_MAX) {
>> +        g_autofree void *buf = g_malloc(len);
> 
> I wonder if this should be blk_blockalign() if first->offset is aligned to the cluster size.  (Don’t know how common that case is.)

blockalign may be better, yes

> 
>> +        char *p = buf;
>> +
>> +        for (e = first; e != end_extent; e = QLIST_NEXT(e, next)) {
>> +            memcpy(p, e->buf, e->bytes);
>> +            p += e->bytes;
>> +        }
>> +
>> +        if (tail) {
>> +            memset(p, 0, tail);
>> +        }
>> +
>> +        BLKDBG_EVENT(s->data_file, BLKDBG_WRITE_COMPRESSED);
>> +        return bdrv_co_pwrite(s->data_file, first->offset, len, buf, 0);
>> +    } else {
>> +        int ret;
>> +        QEMUIOVector qiov;
>> +        g_autofree void *tail_buf = NULL;
>> +
>> +        qemu_iovec_init(&qiov, nb_extents);
>> +        for (e = first; e != end_extent; e = QLIST_NEXT(e, next)) {
>> +            qemu_iovec_add(&qiov, e->buf, e->bytes);
>> +        }
>> +
>> +        if (tail) {
>> +            tail_buf = g_malloc0(tail);
>> +            qemu_iovec_add(&qiov, tail_buf, tail);
>> +        }
>> +
>> +        BLKDBG_EVENT(s->data_file, BLKDBG_WRITE_COMPRESSED);
>> +        ret = bdrv_co_pwritev(s->data_file, first->offset, len, &qiov, 0);
>> +        qemu_iovec_destroy(&qiov);
>> +        return ret;
>> +    }
> 
> The write functions are missing overlap checks.  It can be argued that this is done by qcow2_co_pwritev_compressed_task() already, but the idea was to pair all actual writes with overlap checks (i.e., cached writes in qcow2_co_pwritev_compressed_task() wouldn’t need to do a check).

Some more checks will not hurt anyway

> 
>> +}
>> +
>> +static int coroutine_fn cache_cluster_flush_full(Qcow2CompressedWriteCache *s,
>> +                                                 CacheCluster *cluster)
>> +{
>> +    int ret;
>> +    CacheExtent *end_extent;
>> +    int64_t align = MIN(s->cluster_size,
>> +                        MAX(s->data_file->bs->bl.request_alignment, 4 * 1024));
>> +
>> +    assert(cache_cluster_is_full(cluster));
>> +
>> +    ret = flush_consequitive_extents(s, QLIST_FIRST(&cluster->extents),
>> +                                     &end_extent, align);
>> +
>> +    assert(end_extent == NULL); /* all extents flushed */
>> +
>> +    return ret;
>> +}
>> +
>> +static int coroutine_fn cache_cluster_flush(Qcow2CompressedWriteCache *s,
>> +                                            CacheCluster *c)
>> +{
>> +    int ret;
>> +    CacheExtent *e = QLIST_FIRST(&c->extents);
>> +
>> +    if (cache_cluster_is_full(c)) {
>> +        return cache_cluster_flush_full(s, c);
>> +    }
>> +
>> +    while (e) {
>> +        ret = flush_consequitive_extents(s, e, &e, 1);
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +int coroutine_fn qcow2_compressed_cache_co_flush(Qcow2CompressedWriteCache *s)
>> +{
>> +    int ret = 0;
>> +    CacheCluster *c;
>> +    GList *local_clusters = NULL, *p;
>> +
>> +    /*
>> +     * Make a snapshot of current state: we will not flush clusters created in
>> +     * parallel with flush operations and don't allow adding more extents to
>> +     * staged clusters. We are also protected from parallel flush operations
>> +     * flushing the same clusters.
>> +     */
>> +    QTAILQ_FOREACH(c, &s->clusters, next) {
>> +        if (!c->in_flight && c->n_bytes) {
>> +            c->in_flight = true;
>> +            local_clusters = g_list_append(local_clusters, c);
>> +        }
>> +    }
>> +
>> +    for (p = local_clusters; p; p = p->next) {
>> +        CacheCluster *c = p->data;
>> +
>> +        if (ret == 0) {
>> +            ret = cache_cluster_flush(s, c);
>> +        }
>> +
>> +        c->in_flight = false;
>> +        qemu_co_queue_restart_all(&c->waiters);
>> +
>> +        if (ret == 0) {
>> +            cache_cluster_remove(s, c);
>> +        }
>> +    }
>> +
>> +    g_list_free(local_clusters);
>> +
>> +    return ret;
>> +}
>> +
>> +int coroutine_fn
>> +qcow2_compressed_cache_co_stop_flush(Qcow2CompressedWriteCache *s)
>> +{
>> +    int ret, save;
>> +
>> +    save = s->max_active_clusters;
>> +    s->max_active_clusters = 0; /* No more extents */
> 
> Perhaps better “Don’t cache any more extents” / “Don’t cache any more compressed clusters”?

right. or just "clusters", as compressed cluster actually maps to extent.

> 
>> +
>> +    ret = qcow2_compressed_cache_co_flush(s);
>> +    if (ret < 0) {
>> +        s->max_active_clusters = save;
>> +        return ret;
>> +    }
>> +
>> +    assert(QTAILQ_EMPTY(&s->clusters));
>> +    return 0;
>> +}
>> +
>> +/* @cluster takes ownership of @extent */
>> +static void cluster_add_extent(Qcow2CompressedWriteCache *s,
>> +                               CacheCluster *cluster, CacheExtent *extent)
>> +{
>> +    CacheExtent *e;
>> +
>> +    assert(extent->bytes);
>> +    assert(extent->offset >= cluster->cluster_offset);
>> +    assert(extent->offset + extent->bytes <= cluster->data_end);
>> +    assert(!cluster->in_flight);
>> +
>> +    e = QLIST_FIRST(&cluster->extents);
>> +    if (!e) {
>> +        /* inactive cluster */
>> +        assert(!cluster->n_bytes);
>> +        s->nb_active_clusters++;
>> +        assert(s->nb_active_clusters <= s->max_active_clusters);
>> +        QLIST_INSERT_HEAD(&cluster->extents, extent, next);
>> +    } else if (e->offset > extent->offset) {
>> +        assert(extent->offset + extent->bytes <= e->offset);
>> +        QLIST_INSERT_HEAD(&cluster->extents, extent, next);
>> +    } else {
>> +        while (QLIST_NEXT(e, next) &&
>> +               QLIST_NEXT(e, next)->offset < extent->offset) {
>> +            e = QLIST_NEXT(e, next);
>> +        }
>> +
>> +        /* Now e is last element with offset < extent->offset */
>> +        assert(e->offset + e->bytes <= extent->offset);
>> +
>> +        QLIST_INSERT_AFTER(e, extent, next);
>> +
>> +        e = QLIST_NEXT(extent, next);
>> +        if (e) {
>> +            assert(extent->offset + extent->bytes <= e->offset);
>> +        }
>> +    }
> 
> I like how in C one always tends to implement everything from scratch.I think adding an element into an already-sorted list is a standard problem, so it’s a shame that it’s implemented from scratch here.  Also that it doesn’t use binary search, which would require a random-access list, so...

Yes, we have g_list_insert_sorted(), but it doesn't use binary search anyway. So I decided to use typed lists as Kevin said they are better.. Still I'm not fun of all these macros.

> 
> Well.  Not sure if it’s necessary.  Probably not.  Although I find it weird to start the search from the front when I’d imagine it’s more likely that new compressed clusters (extents) are added to the back.

Thought about this too, but decided to not care.

1. I think, it's all doesn't really matter, as io write operation is a lot longer than all these opeartions with a list.

2. Still it's interesting :)

When adding extent we want to check that it doesn't intersect with other extents. And we want to have sorted list when flush the cluster.

In GLib there is GTree. It has g_tree_foreach in sorted order. But it doesn't have an option to get neighbours of inserted element (to check intersection), neither it have something like lower_bound and upper_bound :\

Hmm, implementing a tree from stratch is not what I want to do in context of my cache :\  Any ideas?

wow, I found this: https://gitlab.gnome.org/GNOME/glib/-/merge_requests/1509

So, 3 months ago all we need was merged into GLib master.. Aha and it's in GLib 2.67.0.. and centos7 has 2.56..

I can add a comment: "refactor using GTree node-based API when it become available (it will in GLib 2.67)"

> 
>> +
>> +    cluster->n_bytes += extent->bytes;
>> +}
>> +
>> +static CacheCluster *find_cluster_to_flush(Qcow2CompressedWriteCache *s)
>> +{
>> +    CacheCluster *c;
>> +
>> +    QTAILQ_FOREACH(c, &s->clusters, next) {
>> +        if (!c->in_flight && cache_cluster_is_full(c)) {
>> +            return c;
> 
> I don’t like it very much that this cache is built as an R/W cache, i.e. that entries are retained until they need to be flushed because a new entry needs space.
> 
> It was my impression this was meant as a write cache, that should speed up specifically O_DIRECT operation.  To me, that implies that entries are flushed once they are done, which is precisely why it works for compressed clusters, because we know when those are done.

We learn that cluster is finished in two cases:

1. when data_end is set and we see that cluster is full. That's a bad place for flushing, as we are under qcow2 mutex.
2. when we add an extent. That's occures during some write operation.. And I don't see real difference, flushing cluster here or when we want to create one more cluster.

And because we don't want to flush at [1], we'll probably have some lost full cluster if flush only on [2]. So, we anyway should flush when want to create new cluster but cache is full. And then no real reason to flush on [2]

> 
> I suppose nobody cares in practice because compression is basically only used for writing out whole images (so displacing clusters happens all the time, and once you’re done, the image is flushed and closed), but it just looks strange to me.
> 
>> +        }
>> +    }
>> +
>> +    return NULL;
>> +}
>> +
>> +/* Cache an extent if there is a place */
>> +static bool coroutine_fn
>> +try_cache_extent(Qcow2CompressedWriteCache *s, CacheExtent *extent,
>> +                 bool *cluster_in_flight)
> 
> Like with other functions, but here especially, I would have liked some documentation on the interface.

will add

>  I suppose the return value reflects whether the “try” worked (so a good guess is possible to me, but it’s still a guess at this point).
> 
> I have no idea what *cluster_in_flight means.

Neither me at this point, I don't remember:) But everything is so obvious when you are writing the code..

Ok, looking forward, *cluster_in_flight means that cluster is now flushing, so please don't even retry to cache this extent!

> I would have guessed it means whether the host cluster touched by the compressed cluster is in flight, but it’s also set when the cache is disabled.  The caller seems to interpret it as “write this immediately”, which to me actually seems wrong, but more on that in said caller function below.
> 
> Perhaps this function should just have an enum return value that tells the caller precisely what to do with some expressively named values.
> 
>> +{
>> +    CacheCluster *c;
>> +    int64_t cluster_offset = QEMU_ALIGN_DOWN(extent->offset, s->cluster_size);
>> +
>> +    assert(extent->bytes);
>> +
>> +    if (s->max_active_clusters == 0) {
>> +        *cluster_in_flight = true;
>> +        return false;
>> +    }
>> +
>> +    *cluster_in_flight = false;
>> +
>> +    c = find_cluster(s, cluster_offset);
>> +    if (c && c->in_flight) {
>> +        *cluster_in_flight = true;
>> +        return false;
>> +    }
>> +    if (s->nb_active_clusters >= s->max_active_clusters &&
>> +        (!c || !c->n_bytes))
>> +    {
>> +        /*
>> +         * Cache is full, we can't allocate a new cluster and can't activate
>> +         * existing inactive cluster
>> +         */
>> +        return false;
>> +    }
>> +
>> +    if (!c) {
>> +        c = cache_cluster_new(s, cluster_offset);
>> +    }
>> +
>> +    cluster_add_extent(s, c, extent);
>> +
>> +    if (cache_cluster_is_full(c)) {
>> +        qemu_co_queue_restart_all(&s->waiters);
> 
> This interface is unclear.  The documentation of s->waiters says to wake them up once there is a free space in the cache, but that isn’t the case here.  It’s true that it’s easy to make a free space (by flushing the full cluster), but there is no free space.

Problem is in documentation, will fix.

> 
> This also ties in to my uneasiness about how full clusters aren’t flushed immediately.
> 
>> +    }
>> +
>> +    return true;
>> +}
>> +
>> +/* Takes ownership of @buf, don't free it after call! */
>> +int coroutine_fn
>> +qcow2_compressed_cache_co_write(Qcow2CompressedWriteCache *s, int64_t offset,
>> +                                int64_t bytes, void *buf)
>> +{
>> +    int ret;
>> +    int i;
>> +    CacheExtent *extents[] = {NULL, NULL};
>> +    int nb_extents = 0; /* number of non-NULL elements in @extents */
>> +    int64_t e0_len;
>> +
>> +    assert(bytes > 0);
>> +    assert(bytes < s->cluster_size);
>> +
>> +    e0_len = MIN(bytes, QEMU_ALIGN_UP(offset + 1, s->cluster_size) - offset);
>> +    extents[0] = cache_extent_new(offset, e0_len, buf);
>> +    nb_extents = 1;
>> +
>> +    if (bytes > e0_len) {
>> +        int64_t e1_len = bytes - e0_len;
>> +        /*
>> +         * We have to allocate a new buffer, so that clusters are independent
>> +         * and can free their extents when needed.
>> +         */
>> +        void *e1_buf = g_memdup(((const char *)buf) + e0_len, e1_len);
>> +
>> +        extents[1] = cache_extent_new(offset + e0_len, e1_len, e1_buf);
>> +        nb_extents = 2;
>> +    }
>> +
>> +    while (nb_extents) {
>> +        bool do_write = false;
>> +
>> +        for (i = 0; i < 2; i++) {
>> +            CacheExtent *e = extents[i];
>> +
>> +            do_write = false;
>> +
>> +            if (!e) {
>> +                continue;
>> +            }
>> +
>> +            if (try_cache_extent(s, e, &do_write)) {
>> +                extents[i] = NULL;
>> +                nb_extents--;
>> +                continue;
>> +            }
>> +
>> +            if (do_write) {
>> +                ret = bdrv_co_pwrite(s->data_file, e->offset, e->bytes,
>> +                                     e->buf, 0);
> 
> Is this safe?  do_write is set if the host cluster touched by this extent is in flight.  Because it can’t be full, it must currently be in the process of being flushed by cache_cluster_flush(), which flushes all consecutive areas with flush_consequitive_extents().  If the underlying file has some request alignment (which it tends to do with O_DIRECT), then those writes are likely transformed into RMW operations spanning more than their precise byte range.  Isn’t it possible that one of those RMWs then intersects with this write here?

Hmm. I thought that if user writes to in_fligth cluster, it's wrong anyway, so we should not care.. But it may be not-full in-fligth cluster because of previous "flush" operation.. And in this case we should wait until it flushed.

> 
> If the host cluster is in flight, shouldn’t we just wait until it’s flushed?

agree

> 
>> +
>> +                cache_extent_free(e);
>> +                extents[i] = NULL;
>> +                nb_extents--;
>> +
>> +                if (ret < 0) {
>> +                    goto out;
>> +                }
>> +            }
>> +        }
>> +
>> +        if (do_write) {
>> +            /*
>> +             * We yielded during second extent writing. Probably the cache is
>> +             * already free and we can now cache the first extent.
> 
> Where do we yield?  Do you mean it probably takes time to do the write, so it’s possible that if do_write is true because some cluster was in the process of being flushed, it is now removed from the cache, so we have a free entry?
> 
> Why not just check for whether we have a free entry?

and we do it, moving to try_cache_extent at start of loop..

> 
> Also, it’s possible that do_write is true just because the cache is disabled, which I guess is handled fine because then both extents will have been written to disk, so that nb_extents is now 0...  But that behavior isn’t documented here.
> 
>> +             */
>> +            continue;
>> +        }
>> +
>> +        if (nb_extents) {
>> +            CacheCluster *cluster = find_cluster_to_flush(s);
>> +
>> +            if (cluster) {
>> +                cluster->in_flight = true;
>> +                ret = cache_cluster_flush_full(s, cluster);
>> +                cluster->in_flight = false;
>> +                qemu_co_queue_restart_all(&cluster->waiters);
>> +                qemu_co_queue_restart_all(&s->waiters);
> 
> Looks weird to wake up other waiters here, because we are very likely going to take that spot, so it’s likely that nobody will be able to make use of it after we took another iteration.
> 
> It’s also weird because it’s different than the two other places which wake up s->waiters not if there’s a free space, but if a free space can be made by flushing find_cluster_to_flush().  As in, this is the only place where the queue is used as it is documented.

Sorry me. I'll come back with better documentation for v2

> 
> I think this again ties in with the question whether we shouldn’t flush clusters as soon as they are full instead of waiting until we need another free entry.

The only reason is what I've said above about qcow2 mutext.. We should not do extra things, when data_end of cluster is set. And that's why we'll have postponed clusters anyway.

> 
>> +                if (ret < 0) {
>> +                    goto out;
>> +                }
>> +                cache_cluster_remove(s, cluster);
>> +                continue;
>> +            }
>> +
>> +            qemu_co_queue_wait(&s->waiters, NULL);
>> +        }
>> +    }
>> +
>> +    ret = 0;
>> +
>> +out:
>> +    for (i = 0; i < 2; i++) {
>> +        cache_extent_free(extents[i]);
>> +    }
>> +
>> +    return 0;
> 
> return ret, I think.
> 
>> +}
>> +
>> +int coroutine_fn
>> +qcow2_compressed_cache_co_read(Qcow2CompressedWriteCache *s, int64_t offset,
>> +                               int64_t bytes, void *buf)
>> +{
>> +    CacheCluster *c;
>> +    CacheExtent *e;
>> +    int64_t cluster_offset = QEMU_ALIGN_DOWN(offset, s->cluster_size);
>> +
>> +    c = find_cluster(s, cluster_offset);
>> +    if (c) {
>> +        QLIST_FOREACH(e, &c->extents, next) {
>> +            if (e->offset == offset && e->bytes <= bytes) {
>> +                memcpy(buf, e->buf, e->bytes);
>> +                return 0;
>> +            }
> 
> Again, reads like it should be a binary search.
> 
> I suppose e->bytes < bytes is OK because the only caller of this function is qcow2_co_preadv_compressed(), which passes a maximum length, effectively.  But then I’d call the parameter accordingly (i.e., max_bytes) and zero out the tail.

OK

> 
>> +        }
>> +    }
>> +
>> +    return bdrv_co_pread(s->data_file, offset, bytes, buf, 0);
>> +}
>> +
>> +/*
>> + * Caller states, that there would be no writes to this cluster beyond
>> + * specified @cluster_data_end. So, it's OK to flush cluster when it is filled
>> + * up to @cluster_data_end and it's OK to align flushing write operation up to
>> + * some alignment (not greater than cluster_size of course).
>> + */
>> +void coroutine_fn
>> +qcow2_compressed_cache_co_set_cluster_end(Qcow2CompressedWriteCache *s,
>> +                                          int64_t cluster_data_end)
>> +{
>> +    int64_t cluster_offset = QEMU_ALIGN_DOWN(cluster_data_end, s->cluster_size);
> 
> If cluster_data_end is aligned to a cluster boundary (I don’t think the caller does that, but...) this will do nothing, so cluster_offset is likely to be one cluster too high.

well add an assertion

> 
> I wonder whether we should assert that cluster_data_end is not aligned to the cluster size, or whether we should use cluster_data_end - 1 here, or whether the caller should pass cluster_offset.
> 
>> +    CacheExtent *e;
>> +    CacheCluster *c;
>> +
>> +    c = find_cluster(s, cluster_offset);
>> +    if (!c) {
>> +        c = cache_cluster_new(s, cluster_offset);
> 
> Is this case reachable?  How?

Ooops... I feel stupid. Great thing that you asked this question!

Somehow I thought that already allocated clusters may be in progress of... what? compression? writing?.. But actually allocation is of course done after compression. So we may compress clusters in parallel, but than they allocated and written sequentially. Of courese without the cache the writes themselves may go in parallel. But write to cache is immediate and doesn't yield. We even can do it before unlocking qcow2 mutex to be sure that we don't switch to another coroutine to write next extent first.

Seems I was wrong, and all this can be a lot simpler. Very sorry for your time. But it helped me to open my eyes :\


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/7] block/qcow2: use compressed write cache
  2021-02-10 17:11   ` Max Reitz
@ 2021-02-11 12:53     ` Vladimir Sementsov-Ogievskiy
  2021-02-18 16:02       ` Max Reitz
  0 siblings, 1 reply; 32+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-02-11 12:53 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, eblake, armbru, kwolf, den

10.02.2021 20:11, Max Reitz wrote:
> On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
>> Introduce a new option: compressed-cache-size, with default to 64
>> clusters (to be not less than 64 default max-workers for backup job).
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   qapi/block-core.json   |  8 +++-
>>   block/qcow2.h          |  4 ++
>>   block/qcow2-refcount.c | 13 +++++++
>>   block/qcow2.c          | 87 ++++++++++++++++++++++++++++++++++++++++--
>>   4 files changed, 108 insertions(+), 4 deletions(-)
>>
>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>> index 9f555d5c1d..e0be6657f3 100644
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -3202,6 +3202,11 @@
>>   #             an image, the data file name is loaded from the image
>>   #             file. (since 4.0)
>>   #
>> +# @compressed-cache-size: The maximum size of compressed write cache in
>> +#                         bytes. If positive must be not less than
>> +#                         cluster size. 0 disables the feature. Default
>> +#                         is 64 * cluster_size. (since 6.0)
> 
> Do we need this, really?  If you don’t use compression, the cache won’t use any memory, right?  Do you plan on using this option?
> 
> I’d just set it to a sane default.

OK for me

> 
> OTOH, “a sane default” poses two questions, namely whether 64 * cluster_size is reasonable – with subclusters, the cluster size may be rather high, so 64 * cluster_size may well be like 128 MB.  Are 64 clusters really necessary for a reasonable performance?
> 
> Second, I think I could live with a rather high default if clusters are flushed as soon as they are full.  OTOH, as I briefly touched on, in practice, I suppose compressed images are just written to constantly, so even if clusters are flushed as soon as they are full, the cache will still remain full all the time.
> 
> 
> Different topic: Why is the cache disableable?  I thought there are no downsides?
> 

to compare performance for example..


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/7] block/qcow2: introduce cache for compressed writes
  2021-02-11 12:49     ` Vladimir Sementsov-Ogievskiy
@ 2021-02-18 15:04       ` Max Reitz
  0 siblings, 0 replies; 32+ messages in thread
From: Max Reitz @ 2021-02-18 15:04 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: kwolf, den, qemu-devel, armbru

On 11.02.21 13:49, Vladimir Sementsov-Ogievskiy wrote:
> you may jump first to my last inline answer
> 
> 10.02.2021 20:07, Max Reitz wrote:
>> On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
>>> Compressed writes and O_DIRECT are not friends: they works too slow,
>>> because compressed writes does many small unaligned to 512 writes.
>>>
>>> Let's introduce an internal cache, so that compressed writes may work
>>> well when O_DIRECT is on.
>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> ---
>>>   block/qcow2.h                        |  29 +
>>>   block/qcow2-compressed-write-cache.c | 770 +++++++++++++++++++++++++++
>>>   block/meson.build                    |   1 +
>>>   3 files changed, 800 insertions(+)
>>>   create mode 100644 block/qcow2-compressed-write-cache.c
>>>
>>> diff --git a/block/qcow2.h b/block/qcow2.h
>>> index 0678073b74..fbdedf89fa 100644
>>> --- a/block/qcow2.h
>>> +++ b/block/qcow2.h
>>> @@ -322,6 +322,8 @@ typedef struct Qcow2BitmapHeaderExt {
>>>       uint64_t bitmap_directory_offset;
>>>   } QEMU_PACKED Qcow2BitmapHeaderExt;
>>> +typedef struct Qcow2CompressedWriteCache Qcow2CompressedWriteCache;
>>> +
>>>   #define QCOW2_MAX_THREADS 4
>>>   typedef struct BDRVQcow2State {
>>> @@ -1010,4 +1012,31 @@ int coroutine_fn
>>>   qcow2_co_decrypt(BlockDriverState *bs, uint64_t host_offset,
>>>                    uint64_t guest_offset, void *buf, size_t len);
>>> +Qcow2CompressedWriteCache *qcow2_compressed_cache_new(BdrvChild 
>>> *data_file,
>>> +                                                      int64_t 
>>> cluster_size,
>>> +                                                      int64_t 
>>> cache_size);
>>> +void qcow2_compressed_cache_free(Qcow2CompressedWriteCache *s);
>>> +int coroutine_fn
>>> +qcow2_compressed_cache_co_read(Qcow2CompressedWriteCache *s, int64_t 
>>> offset,
>>> +                               int64_t bytes, void *buf);
>>> +int coroutine_fn
>>> +qcow2_compressed_cache_co_write(Qcow2CompressedWriteCache *s, 
>>> int64_t offset,
>>> +                                int64_t bytes, void *buf);
>>> +void coroutine_fn
>>> +qcow2_compressed_cache_co_set_cluster_end(Qcow2CompressedWriteCache *s,
>>> +                                          int64_t cluster_data_end);
>>> +int coroutine_fn
>>> +qcow2_compressed_cache_co_flush(Qcow2CompressedWriteCache *s);
>>> +int qcow2_compressed_cache_flush(BlockDriverState *bs,
>>> +                                 Qcow2CompressedWriteCache *state);
>>> +int coroutine_fn
>>> +qcow2_compressed_cache_co_stop_flush(Qcow2CompressedWriteCache *s);
>>> +int qcow2_compressed_cache_stop_flush(BlockDriverState *bs,
>>> +                                      Qcow2CompressedWriteCache *s);
>>> +void qcow2_compressed_cache_set_size(Qcow2CompressedWriteCache *s,
>>> +                                     int64_t size);
>>> +void coroutine_fn
>>> +qcow2_compressed_cache_co_discard(Qcow2CompressedWriteCache *s,
>>> +                                  int64_t cluster_offset);
>>> +
>>
>> It would be nice if these functions had their interface documented 
>> somewhere.
> 
> I tried to comment dificult things in .c... Is there a prefernce, where 
> to document
> how and what function does, in .h or in .c ?

No, but my problem was that I found even the things you probably didn’t 
consider difficult not completely obvious; i.e., I would’ve liked a full 
documentation on the function interface.  (And I don’t think there is 
such documentation in the .c file.)

>>>   #endif
>>> diff --git a/block/qcow2-compressed-write-cache.c 
>>> b/block/qcow2-compressed-write-cache.c
>>> new file mode 100644
>>> index 0000000000..7bb92cb550
>>> --- /dev/null
>>> +++ b/block/qcow2-compressed-write-cache.c
>>> @@ -0,0 +1,770 @@
>>> +/*
>>> + * Write cache for qcow2 compressed writes
>>> + *
>>> + * Copyright (c) 2021 Virtuozzo International GmbH.
>>> + *
>>> + * Permission is hereby granted, free of charge, to any person 
>>> obtaining a copy
>>> + * of this software and associated documentation files (the 
>>> "Software"), to deal
>>> + * in the Software without restriction, including without limitation 
>>> the rights
>>> + * to use, copy, modify, merge, publish, distribute, sublicense, 
>>> and/or sell
>>> + * copies of the Software, and to permit persons to whom the 
>>> Software is
>>> + * furnished to do so, subject to the following conditions:
>>> + *
>>> + * The above copyright notice and this permission notice shall be 
>>> included in
>>> + * all copies or substantial portions of the Software.
>>> + *
>>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 
>>> EXPRESS OR
>>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 
>>> MERCHANTABILITY,
>>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT 
>>> SHALL
>>> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES 
>>> OR OTHER
>>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, 
>>> ARISING FROM,
>>> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 
>>> DEALINGS IN
>>> + * THE SOFTWARE.
>>> + */
>>> +
>>> +#include "qemu/osdep.h"
>>> +
>>> +#include "block/block_int.h"
>>> +#include "block/block-gen.h"
>>> +#include "qemu/coroutine.h"
>>> +#include "qapi/qapi-events-block-core.h"
>>> +#include "qcow2.h"
>>> +
>>> +typedef struct CacheExtent {
>>> +    int64_t offset;
>>> +    int64_t bytes;
>>> +    void *buf;
>>> +    QLIST_ENTRY(CacheExtent) next;
>>> +} CacheExtent;
>>> +
>>> +typedef struct CacheCluster {
>>
>> It isn’t immediately clear what these two structures mean by just 
>> their name, because “extent” has no meaning in the context of qcow2.
> 
> It's not important for the cache are extents compressed clusters or 
> not.. So I'd keep more generic name

OK, but then my problem was that I had no idea what it’s supposed to 
represent.  I had to read the code, which I don’t think is ideal (i.e., 
I’d like to know what the structures are so I can understand what the 
code does, not having to figure out both at the same time).

>> I understand CacheExtent to basically be a compressed cluster, and 
>> CacheCluster to be a host cluster.  Perhaps their names should reflect 
>> that.
>>
>> (OTOH, the Cache* prefix seems unnecessary to me, because these are 
>> just local structs.)

[...]

>>> +
>>> +struct Qcow2CompressedWriteCache {
>>> +    /*
>>> +     * @data_file and @cluster_size are copied from qcow2 state. Not 
>>> huge
>>> +     * duplications, seems better to avoid accessing the whole qcow2 
>>> state
>>> +     * instead.
>>
>> Why?  What if in the future the data_file child can be changed with 
>> reopen?
> 
> I didn't considered this.. Still it's impossible now, yes? But would be 
> possible in future of course.
> 
>>
>> What’s the argument against accessing the whole qcow2 state?  That 
>> it’s a layering violation, that we shouldn’t access it while the rest 
>> uses it? 
> 
> To keep it simple to reuse the cache for something other.

I could give the argument of “isn’t that impossible right now?” right back.

But anyhow, that is a reason for “why”, but then it should be noted in 
the comment (i.e., not saying it’s better, but saying why it’s necessary).

>> If that’s a concern, perhaps *data_file should be an indirect pointer 
>> to the pointer in the BDRVQcow2State.
> 
> Or we should update it on reopen (like cache_size)

Yes, if we keep it duplicated here, then we’ll need to do it.

(The code didn’t note why it needs to be duplicated here, so if there 
had been no reason, it would’ve been easier not to duplicate it.)

[...]

>>> +static int coroutine_fn 
>>> flush_consequitive_extents(Qcow2CompressedWriteCache *s,
>>> +                                                   CacheExtent *first,
>>> +                                                   CacheExtent **next,
>>> +                                                   int64_t align)
>>> +{
>>> +    CacheExtent *e, *end_extent;
>>> +    int64_t end;
>>> +    int nb_extents = count_consequitive_extents(first, &end_extent, 
>>> &end);
>>
>> I’d prefer nb_bufs, because it will include a tail if present.
>>
>>> +    int64_t aligned_end = QEMU_ALIGN_UP(end, align);
>>> +    int64_t tail = aligned_end - end;
>>> +    int64_t len = aligned_end - first->offset;
>>> +
>>> +    /*
>>> +     * Alignment if for flushing full cluster, first extent offset 
>>> is always
>>
>> s/.*,/The alignment will not exceed the cluster size, so the/?
> 
> s/if/is/. When cluster is "full", which means it is filled up to its 
> data_end, the remaining tail is unsed, and we can align it up.

That doesn’t really help me understand.  What does “Alignment is for 
flushing full cluster” mean?  That the alignment is equal to the cluster 
size?  I think there should still be an "if" in there somewhere.

I think the full explanation for why this assertion is here and what it 
means is “If a full cluster is flushed and so a block alignment has been 
given*, the first extent's offset will be at the start of the cluster. 
Otherwise, the alignment is 1.  Therefore, if an alignment greater than 
1 is given, at most we need to add a tail, but not a head (which is what 
this assertion shows).”, but perhaps I’m still wrong.

*I think this statement then requires a comment on what the alignment is 
even for.  I’m not sure myself, because, well, it isn’t commented on 
anywhere.  I think it’s because we want to avoid RMW cycles for 
O_DIRECT, so the alignment most of the time will fit the host device 
block size (though at least 4k, I suppose to include devices with a 
logical block size of 512, but a physical block size of 4k).  But we 
can’t necessarily do that for non-full clusters, because those might 
need a padding head to accomplish that, but that might overwrite 
existing data that we don’t want to overwrite, so to keep it simple, we 
just align=1 there.  Is that about right?

[...]

>>> +/* @cluster takes ownership of @extent */
>>> +static void cluster_add_extent(Qcow2CompressedWriteCache *s,
>>> +                               CacheCluster *cluster, CacheExtent 
>>> *extent)
>>> +{
>>> +    CacheExtent *e;
>>> +
>>> +    assert(extent->bytes);
>>> +    assert(extent->offset >= cluster->cluster_offset);
>>> +    assert(extent->offset + extent->bytes <= cluster->data_end);
>>> +    assert(!cluster->in_flight);
>>> +
>>> +    e = QLIST_FIRST(&cluster->extents);
>>> +    if (!e) {
>>> +        /* inactive cluster */
>>> +        assert(!cluster->n_bytes);
>>> +        s->nb_active_clusters++;
>>> +        assert(s->nb_active_clusters <= s->max_active_clusters);
>>> +        QLIST_INSERT_HEAD(&cluster->extents, extent, next);
>>> +    } else if (e->offset > extent->offset) {
>>> +        assert(extent->offset + extent->bytes <= e->offset);
>>> +        QLIST_INSERT_HEAD(&cluster->extents, extent, next);
>>> +    } else {
>>> +        while (QLIST_NEXT(e, next) &&
>>> +               QLIST_NEXT(e, next)->offset < extent->offset) {
>>> +            e = QLIST_NEXT(e, next);
>>> +        }
>>> +
>>> +        /* Now e is last element with offset < extent->offset */
>>> +        assert(e->offset + e->bytes <= extent->offset);
>>> +
>>> +        QLIST_INSERT_AFTER(e, extent, next);
>>> +
>>> +        e = QLIST_NEXT(extent, next);
>>> +        if (e) {
>>> +            assert(extent->offset + extent->bytes <= e->offset);
>>> +        }
>>> +    }
>>
>> I like how in C one always tends to implement everything from 
>> scratch.I think adding an element into an already-sorted list is a 
>> standard problem, so it’s a shame that it’s implemented from scratch 
>> here.  Also that it doesn’t use binary search, which would require a 
>> random-access list, so...
> 
> Yes, we have g_list_insert_sorted(), but it doesn't use binary search 
> anyway.

lol

(I mean, it’s impossible to use binary search on a GList, but still.)

> So I decided to use typed lists as Kevin said they are better.. 
> Still I'm not fun of all these macros.
> 
>>
>> Well.  Not sure if it’s necessary.  Probably not.  Although I find it 
>> weird to start the search from the front when I’d imagine it’s more 
>> likely that new compressed clusters (extents) are added to the back.
> 
> Thought about this too, but decided to not care.
> 
> 1. I think, it's all doesn't really matter, as io write operation is a 
> lot longer than all these opeartions with a list.
> 
> 2. Still it's interesting :)

Yes, basically my thinking, too.  I just couldn’t help but notice it 
seems like a standard problem that should have a pre-implemented simple 
solution, but I don’t think there’s a reason to spend too much time and 
effort here.

> When adding extent we want to check that it doesn't intersect with other 
> extents. And we want to have sorted list when flush the cluster.
> 
> In GLib there is GTree. It has g_tree_foreach in sorted order. But it 
> doesn't have an option to get neighbours of inserted element (to check 
> intersection), neither it have something like lower_bound and 
> upper_bound :\
> 
> Hmm, implementing a tree from stratch is not what I want to do in 
> context of my cache :\  Any ideas?
> 
> wow, I found this: 
> https://gitlab.gnome.org/GNOME/glib/-/merge_requests/1509
> 
> So, 3 months ago all we need was merged into GLib master.. Aha and it's 
> in GLib 2.67.0.. and centos7 has 2.56..
> 
> I can add a comment: "refactor using GTree node-based API when it become 
> available (it will in GLib 2.67)"

Why not.  Might be cool.

>>> +
>>> +    cluster->n_bytes += extent->bytes;
>>> +}
>>> +
>>> +static CacheCluster *find_cluster_to_flush(Qcow2CompressedWriteCache 
>>> *s)
>>> +{
>>> +    CacheCluster *c;
>>> +
>>> +    QTAILQ_FOREACH(c, &s->clusters, next) {
>>> +        if (!c->in_flight && cache_cluster_is_full(c)) {
>>> +            return c;
>>
>> I don’t like it very much that this cache is built as an R/W cache, 
>> i.e. that entries are retained until they need to be flushed because a 
>> new entry needs space.
>>
>> It was my impression this was meant as a write cache, that should 
>> speed up specifically O_DIRECT operation.  To me, that implies that 
>> entries are flushed once they are done, which is precisely why it 
>> works for compressed clusters, because we know when those are done.
> 
> We learn that cluster is finished in two cases:
> 
> 1. when data_end is set and we see that cluster is full. That's a bad 
> place for flushing, as we are under qcow2 mutex.
> 2. when we add an extent. That's occures during some write operation.. 
> And I don't see real difference, flushing cluster here or when we want 
> to create one more cluster.
> 
> And because we don't want to flush at [1], we'll probably have some lost 
> full cluster if flush only on [2]. So, we anyway should flush when want 
> to create new cluster but cache is full. And then no real reason to 
> flush on [2]

Yes, I acknowledged that for qcow2_compressed_cache_co_set_cluster_end() 
(“I get the intention here, because as documented somewhere, this 
function is called from a mutexed section and we probably don’t want to 
write data to the data_file here.”).

But I also raised the question of whether we might delay the 
set_cluster_end() or the flush until the mutex is no longer taken, so 
we’d still flush as soon as possible.

I suppose if you want a more versatile cache, delaying the flush until 
the cache is needed will make sense, so it also serves as a read cache, 
but in its advertised form for caching compressed clusters specifically 
for the O_DIRECT case, I’d prefer to flush as soon as possible.

>> I suppose nobody cares in practice because compression is basically 
>> only used for writing out whole images (so displacing clusters happens 
>> all the time, and once you’re done, the image is flushed and closed), 
>> but it just looks strange to me.
>>
>>> +        }
>>> +    }
>>> +
>>> +    return NULL;
>>> +}
>>> +
>>> +/* Cache an extent if there is a place */
>>> +static bool coroutine_fn
>>> +try_cache_extent(Qcow2CompressedWriteCache *s, CacheExtent *extent,
>>> +                 bool *cluster_in_flight)
>>
>> Like with other functions, but here especially, I would have liked 
>> some documentation on the interface.
> 
> will add
> 
>>   I suppose the return value reflects whether the “try” worked (so a 
>> good guess is possible to me, but it’s still a guess at this point).
>>
>> I have no idea what *cluster_in_flight means.
> 
> Neither me at this point, I don't remember:) But everything is so 
> obvious when you are writing the code..
> 
> Ok, looking forward, *cluster_in_flight means that cluster is now 
> flushing, so please don't even retry to cache this extent!

But that doesn’t apply when the cache is disabled, in which case 
*cluster_in_flight is also set.

>> I would have guessed it means whether the host cluster touched by the 
>> compressed cluster is in flight, but it’s also set when the cache is 
>> disabled.  The caller seems to interpret it as “write this 
>> immediately”, which to me actually seems wrong, but more on that in 
>> said caller function below.
>>
>> Perhaps this function should just have an enum return value that tells 
>> the caller precisely what to do with some expressively named values.

So this is still something I think might make sense.

[...]

>>> +
>>> +                cache_extent_free(e);
>>> +                extents[i] = NULL;
>>> +                nb_extents--;
>>> +
>>> +                if (ret < 0) {
>>> +                    goto out;
>>> +                }
>>> +            }
>>> +        }
>>> +
>>> +        if (do_write) {
>>> +            /*
>>> +             * We yielded during second extent writing. Probably the 
>>> cache is
>>> +             * already free and we can now cache the first extent.
>>
>> Where do we yield?  Do you mean it probably takes time to do the 
>> write, so it’s possible that if do_write is true because some cluster 
>> was in the process of being flushed, it is now removed from the cache, 
>> so we have a free entry?
>>
>> Why not just check for whether we have a free entry?
> 
> and we do it, moving to try_cache_extent at start of loop..

Yes.  But why not check here, and only continue if there is a free entry?

(The problem with “continue” is that it’s a goto in disguise, so I don’t 
like using it just to avoid code duplication.  I think “continue” should 
only be used if you really want to do the next iteration, but in this 
case it seems like you just want to avoid a duplicated check.)

[...]

>>> +    CacheExtent *e;
>>> +    CacheCluster *c;
>>> +
>>> +    c = find_cluster(s, cluster_offset);
>>> +    if (!c) {
>>> +        c = cache_cluster_new(s, cluster_offset);
>>
>> Is this case reachable?  How?
> 
> Ooops... I feel stupid. Great thing that you asked this question!
> 
> Somehow I thought that already allocated clusters may be in progress 
> of... what? compression? writing?.. But actually allocation is of course 
> done after compression. So we may compress clusters in parallel, but 
> than they allocated and written sequentially. Of courese without the 
> cache the writes themselves may go in parallel. But write to cache is 
> immediate and doesn't yield. We even can do it before unlocking qcow2 
> mutex to be sure that we don't switch to another coroutine to write next 
> extent first.
> 
> Seems I was wrong, and all this can be a lot simpler. Very sorry for 
> your time. But it helped me to open my eyes :\

Well, looking forward to v2 then. :)

Max



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/7] block/qcow2: use compressed write cache
  2021-02-11 12:53     ` Vladimir Sementsov-Ogievskiy
@ 2021-02-18 16:02       ` Max Reitz
  0 siblings, 0 replies; 32+ messages in thread
From: Max Reitz @ 2021-02-18 16:02 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block; +Cc: kwolf, den, qemu-devel, armbru

On 11.02.21 13:53, Vladimir Sementsov-Ogievskiy wrote:
> 10.02.2021 20:11, Max Reitz wrote:
>> On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
>>> Introduce a new option: compressed-cache-size, with default to 64
>>> clusters (to be not less than 64 default max-workers for backup job).
>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> ---
>>>   qapi/block-core.json   |  8 +++-
>>>   block/qcow2.h          |  4 ++
>>>   block/qcow2-refcount.c | 13 +++++++
>>>   block/qcow2.c          | 87 ++++++++++++++++++++++++++++++++++++++++--
>>>   4 files changed, 108 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>>> index 9f555d5c1d..e0be6657f3 100644
>>> --- a/qapi/block-core.json
>>> +++ b/qapi/block-core.json
>>> @@ -3202,6 +3202,11 @@
>>>   #             an image, the data file name is loaded from the image
>>>   #             file. (since 4.0)
>>>   #
>>> +# @compressed-cache-size: The maximum size of compressed write cache in
>>> +#                         bytes. If positive must be not less than
>>> +#                         cluster size. 0 disables the feature. Default
>>> +#                         is 64 * cluster_size. (since 6.0)
>>
>> Do we need this, really?  If you don’t use compression, the cache 
>> won’t use any memory, right?  Do you plan on using this option?
>>
>> I’d just set it to a sane default.
> 
> OK for me
> 
>>
>> OTOH, “a sane default” poses two questions, namely whether 64 * 
>> cluster_size is reasonable – with subclusters, the cluster size may be 
>> rather high, so 64 * cluster_size may well be like 128 MB.  Are 64 
>> clusters really necessary for a reasonable performance?
>>
>> Second, I think I could live with a rather high default if clusters 
>> are flushed as soon as they are full.  OTOH, as I briefly touched on, 
>> in practice, I suppose compressed images are just written to 
>> constantly, so even if clusters are flushed as soon as they are full, 
>> the cache will still remain full all the time.
>>
>>
>> Different topic: Why is the cache disableable?  I thought there are no 
>> downsides?
>>
> 
> to compare performance for example..

Well :D
Doesn’t seem like a reason to expose it to the outside, though, I don’t 
know.

Max



^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2021-02-18 16:03 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-29 16:50 [PATCH 0/7] qcow2: compressed write cache Vladimir Sementsov-Ogievskiy
2021-01-29 16:50 ` [PATCH 1/7] qemu/queue: add some useful QLIST_ and QTAILQ_ macros Vladimir Sementsov-Ogievskiy
2021-02-01  8:29   ` Markus Armbruster
2021-02-01  8:34     ` Vladimir Sementsov-Ogievskiy
2021-02-10 17:07   ` Max Reitz
2021-01-29 16:50 ` [PATCH 2/7] block/qcow2: introduce cache for compressed writes Vladimir Sementsov-Ogievskiy
2021-02-10 17:07   ` Max Reitz
2021-02-11 12:49     ` Vladimir Sementsov-Ogievskiy
2021-02-18 15:04       ` Max Reitz
2021-01-29 16:50 ` [PATCH 3/7] block/qcow2: use compressed write cache Vladimir Sementsov-Ogievskiy
2021-02-10 17:11   ` Max Reitz
2021-02-11 12:53     ` Vladimir Sementsov-Ogievskiy
2021-02-18 16:02       ` Max Reitz
2021-01-29 16:50 ` [PATCH 4/7] simplebench: bench_one(): add slow_limit argument Vladimir Sementsov-Ogievskiy
2021-01-29 16:50 ` [PATCH 5/7] simplebench: bench_one(): support count=1 Vladimir Sementsov-Ogievskiy
2021-01-29 16:50 ` [PATCH 6/7] simplebench/bench-backup: add --compressed option Vladimir Sementsov-Ogievskiy
2021-01-29 16:50 ` [PATCH 7/7] simplebench/bench-backup: add target-cache argument Vladimir Sementsov-Ogievskiy
2021-01-29 17:30 ` [PATCH 0/7] qcow2: compressed write cache no-reply
2021-02-01  8:24   ` Vladimir Sementsov-Ogievskiy
2021-02-09 13:25 ` Max Reitz
2021-02-09 14:10   ` Vladimir Sementsov-Ogievskiy
2021-02-09 14:47     ` Max Reitz
2021-02-09 16:39       ` Vladimir Sementsov-Ogievskiy
2021-02-09 18:36         ` Vladimir Sementsov-Ogievskiy
2021-02-09 18:41           ` Denis V. Lunev
2021-02-09 18:51             ` Vladimir Sementsov-Ogievskiy
2021-02-10 10:00               ` Max Reitz
2021-02-10 10:10                 ` Vladimir Sementsov-Ogievskiy
2021-02-09 16:52       ` Denis V. Lunev
2021-02-10 10:00         ` Max Reitz
2021-02-10 12:35 ` Kevin Wolf
2021-02-10 14:35   ` Vladimir Sementsov-Ogievskiy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.