All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v3] add write threshold reporting for block devices
@ 2014-11-28 12:31 Francesco Romani
  2014-11-28 12:31 ` [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold Francesco Romani
  0 siblings, 1 reply; 7+ messages in thread
From: Francesco Romani @ 2014-11-28 12:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, mdroth, lcapitulino, Francesco Romani, stefanha

v2 was: with RFC. Since last review round I dropped the tag
because I think now all the open points are addressed.
v1 was: "add watermark reporting for block devices", but
"watermark" is incorrectly unused. Hence the change in subject.

Changes since v2:
-----------------

addressed reviewers comments:
- use node name to find the block driver state to be checked
- report node name in the event notification
- fixed signed vs unsigned integer: use uint64 everywhere, to deal
  with integer overflows (more) gracefully.
- fixed pending style issues
- renamed and made public a few functions to make them testable
- add very basic initial unit tests

Changes since v1:
-----------------

addressed reviewers comments. Highligths:
- fixed terminology: "watermark" -> "usage threshold"
- threshold is expressed in bytes
- make the event triggers only once when threshold crossed
- configured threshold visible in 'query-block' output
- fixed bugs

Open issues:
------------

Not all node names show up in the 'query-block' output, but I'll
start a different thread to discuss this.

Followup:
---------

Patches I'll have on my queue and I'd like to post as followup
- more some unit testing.
- add support to set the threshold at device creation


Rationale and context from v1
-----------------------------

I'm one of the oVirt developers (http://www.ovirt.org);
oVirt is a virtualization management application built
around qemu/kvm, so it is nice to get in touch :)

We have begun a big scalability improvement effort, aiming to
support without problems hundreds of VMs per host, with plans
to support thousands in a not so distant future.
In doing so, we are reviewing our usage flows.

One of them is thin-provisioned storage, which is used
quite extensively, with block devices (ISCSI for example)
and COW images.
When using thin provisioning, oVirt tries hard to hide this
fact from the guest OS, and to do so watches closely
the usage of the device, and resize it when its usage exceeds
a configured threshold (the "high water mark"), in order
to avoid the guest OS to get paused for space exhausted.

To do the watching, we poll he devices using libvirt [1],
which in turn uses query-blockstats.
This is suboptimal with just one VM, but with hundereds of them,
let alone thousands, it doesn't scale and it is quite a resource
hog.

Would be great to have this watermark concept supported into qemu,
with a new event to be raised when the limit is crossed.

To track this RFE I filed https://bugs.launchpad.net/qemu/+bug/1338957

Moreover, I had the chance to take a look at the QEMU sources
and come up with this tentative patch which I'd also like
to submit.

Notes
-----

[0]: https://lists.gnu.org/archive/html/qemu-devel/2014-07/msg01348.html
[1]: http://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=ebb0c19c48690f0598de954f8e0e9d4d29d48b85


Francesco Romani (1):
  block: add event when disk usage exceeds threshold

 block/Makefile.objs             |   1 +
 block/qapi.c                    |   3 +
 block/usage-threshold.c         | 124 ++++++++++++++++++++++++++++++++++++++++
 include/block/block_int.h       |   4 ++
 include/block/usage-threshold.h |  62 ++++++++++++++++++++
 qapi/block-core.json            |  48 +++++++++++++++-
 qmp-commands.hx                 |  28 +++++++++
 tests/Makefile                  |   3 +
 tests/test-usage-threshold.c    | 101 ++++++++++++++++++++++++++++++++
 9 files changed, 373 insertions(+), 1 deletion(-)
 create mode 100644 block/usage-threshold.c
 create mode 100644 include/block/usage-threshold.h
 create mode 100644 tests/test-usage-threshold.c

-- 
1.9.3

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold
  2014-11-28 12:31 [Qemu-devel] [PATCH v3] add write threshold reporting for block devices Francesco Romani
@ 2014-11-28 12:31 ` Francesco Romani
  2014-12-01 21:07   ` Eric Blake
  2014-12-02 15:01   ` Stefan Hajnoczi
  0 siblings, 2 replies; 7+ messages in thread
From: Francesco Romani @ 2014-11-28 12:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, mdroth, lcapitulino, Francesco Romani, stefanha

Managing applications, like oVirt (http://www.ovirt.org), make extensive
use of thin-provisioned disk images.
To let the guest run smoothly and be not unnecessarily paused, oVirt sets
a disk usage threshold (so called 'high water mark') based on the occupation
of the device,  and automatically extends the image once the threshold
is reached or exceeded.

In order to detect the crossing of the threshold, oVirt has no choice but
aggressively polling the QEMU monitor using the query-blockstats command.
This lead to unnecessary system load, and is made even worse under scale:
deployments with hundreds of VMs are no longer rare.

To fix this, this patch adds:
* A new monitor command to set a mark for a given block device.
* A new event to report if a block device usage exceeds the threshold.

This will allow the managing application to drop the polling
altogether and just wait for a watermark crossing event.

Signed-off-by: Francesco Romani <fromani@redhat.com>
---
 block/Makefile.objs             |   1 +
 block/qapi.c                    |   3 +
 block/usage-threshold.c         | 124 ++++++++++++++++++++++++++++++++++++++++
 include/block/block_int.h       |   4 ++
 include/block/usage-threshold.h |  62 ++++++++++++++++++++
 qapi/block-core.json            |  48 +++++++++++++++-
 qmp-commands.hx                 |  28 +++++++++
 tests/Makefile                  |   3 +
 tests/test-usage-threshold.c    | 101 ++++++++++++++++++++++++++++++++
 9 files changed, 373 insertions(+), 1 deletion(-)
 create mode 100644 block/usage-threshold.c
 create mode 100644 include/block/usage-threshold.h
 create mode 100644 tests/test-usage-threshold.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 04b0e43..43e381d 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -20,6 +20,7 @@ block-obj-$(CONFIG_GLUSTERFS) += gluster.o
 block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
 block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o
+block-obj-y += usage-threshold.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
diff --git a/block/qapi.c b/block/qapi.c
index a87a34a..c85abb0 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -24,6 +24,7 @@
 
 #include "block/qapi.h"
 #include "block/block_int.h"
+#include "block/usage-threshold.h"
 #include "qmp-commands.h"
 #include "qapi-visit.h"
 #include "qapi/qmp-output-visitor.h"
@@ -82,6 +83,8 @@ BlockDeviceInfo *bdrv_block_device_info(BlockDriverState *bs)
         info->iops_size = cfg.op_size;
     }
 
+    info->write_threshold = bdrv_usage_threshold_get(bs);
+
     return info;
 }
 
diff --git a/block/usage-threshold.c b/block/usage-threshold.c
new file mode 100644
index 0000000..9faf5e6
--- /dev/null
+++ b/block/usage-threshold.c
@@ -0,0 +1,124 @@
+/*
+ * QEMU System Emulator block usage threshold notification
+ *
+ * Copyright Red Hat, Inc. 2014
+ *
+ * Authors:
+ *  Francesco Romani <fromani@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ */
+
+#include "block/block_int.h"
+#include "block/coroutine.h"
+#include "block/usage-threshold.h"
+#include "qemu/notify.h"
+#include "qapi-event.h"
+#include "qmp-commands.h"
+
+
+uint64_t bdrv_usage_threshold_get(const BlockDriverState *bs)
+{
+    return bs->write_threshold_offset;
+}
+
+bool bdrv_usage_threshold_is_set(const BlockDriverState *bs)
+{
+    return !!(bs->write_threshold_offset > 0);
+}
+
+static void usage_threshold_disable(BlockDriverState *bs)
+{
+    if (bdrv_usage_threshold_is_set(bs)) {
+        notifier_with_return_remove(&bs->write_threshold_notifier);
+        bs->write_threshold_offset = 0;
+    }
+}
+
+uint64_t bdrv_usage_threshold_exceeded(const BlockDriverState *bs,
+                                       const BdrvTrackedRequest *req)
+{
+    if (bdrv_usage_threshold_is_set(bs)) {
+        if (req->offset > bs->write_threshold_offset) {
+            return (req->offset - bs->write_threshold_offset) + req->bytes;
+        }
+        if ((req->offset + req->bytes) > bs->write_threshold_offset) {
+            return (req->offset + req->bytes) - bs->write_threshold_offset;
+        }
+    }
+    return 0;
+}
+
+static int coroutine_fn before_write_notify(NotifierWithReturn *notifier,
+                                            void *opaque)
+{
+    BdrvTrackedRequest *req = opaque;
+    BlockDriverState *bs = req->bs;
+    uint64_t amount = 0;
+
+    amount = bdrv_usage_threshold_exceeded(bs, req);
+    if (amount > 0) {
+        qapi_event_send_block_usage_threshold(
+            bs->node_name,
+            amount,
+            bs->write_threshold_offset,
+            &error_abort);
+
+        /* autodisable to avoid to flood the monitor */
+        usage_threshold_disable(bs);
+    }
+
+    return 0; /* should always let other notifiers run */
+}
+
+static void usage_threshold_register_notifier(BlockDriverState *bs)
+{
+    bs->write_threshold_notifier.notify = before_write_notify;
+    notifier_with_return_list_add(&bs->before_write_notifiers,
+                                  &bs->write_threshold_notifier);
+}
+
+static void usage_threshold_update(BlockDriverState *bs,
+                                   int64_t threshold_bytes)
+{
+    bs->write_threshold_offset = threshold_bytes;
+}
+
+void bdrv_usage_threshold_set(BlockDriverState *bs, uint64_t threshold_bytes)
+{
+    if (bdrv_usage_threshold_is_set(bs)) {
+        if (threshold_bytes > 0) {
+            usage_threshold_update(bs, threshold_bytes);
+        } else {
+            usage_threshold_disable(bs);
+        }
+    } else {
+        if (threshold_bytes > 0) {
+            /* avoid multiple registration */
+            usage_threshold_register_notifier(bs);
+            usage_threshold_update(bs, threshold_bytes);
+        }
+        /* discard bogus disable request */
+    }
+}
+
+void qmp_block_set_threshold(const char *node_name, uint64_t threshold_bytes,
+                             Error **errp)
+{
+    BlockDriverState *bs;
+    AioContext *aio_context;
+
+    bs = bdrv_find_node(node_name);
+    if (!bs) {
+        error_set(errp, QERR_DEVICE_NOT_FOUND, node_name);
+        return;
+    }
+
+    aio_context = bdrv_get_aio_context(bs);
+    aio_context_acquire(aio_context);
+
+    bdrv_usage_threshold_set(bs, threshold_bytes);
+
+    aio_context_release(aio_context);
+}
diff --git a/include/block/block_int.h b/include/block/block_int.h
index a1c17b9..3791d9b 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -409,6 +409,10 @@ struct BlockDriverState {
 
     /* The error object in use for blocking operations on backing_hd */
     Error *backing_blocker;
+
+    /* threshold limit for writes, in bytes. "High water mark". */
+    uint64_t write_threshold_offset;
+    NotifierWithReturn write_threshold_notifier;
 };
 
 int get_tmp_filename(char *filename, int size);
diff --git a/include/block/usage-threshold.h b/include/block/usage-threshold.h
new file mode 100644
index 0000000..12150d6
--- /dev/null
+++ b/include/block/usage-threshold.h
@@ -0,0 +1,62 @@
+/*
+ * QEMU System Emulator block usage threshold notification
+ *
+ * Copyright Red Hat, Inc. 2014
+ *
+ * Authors:
+ *  Francesco Romani <fromani@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ */
+#ifndef BLOCK_USAGE_THRESHOLD_H
+#define BLOCK_USAGE_THRESHOLD_H
+
+#include <stdint.h>
+
+#include "qemu/typedefs.h"
+#include "qemu-common.h"
+
+/*
+ * bdrv_usage_threshold_set:
+ *
+ * Set the usage threshold for block devices, in bytes.
+ * Notify when a write exceeds the threshold, meaning the device
+ * is becoming full, so it can be transparently resized.
+ * To be used with thin-provisioned block devices.
+ *
+ * Use threshold_bytes == 0 to disable.
+ */
+void bdrv_usage_threshold_set(BlockDriverState *bs, uint64_t threshold_bytes);
+
+/*
+ * bdrv_usage_threshold_get
+ *
+ * Get the configured usage threshold, in bytes.
+ * Zero means no threshold configured.
+ */
+uint64_t bdrv_usage_threshold_get(const BlockDriverState *bs);
+
+/*
+ * bdrv_usage_threshold_is_set
+ *
+ * Tell if an usage threshold is set for a given BDS.
+ */
+bool bdrv_usage_threshold_is_set(const BlockDriverState *bs);
+
+/*
+ * bdrv_usage_threshold_exceeded
+ *
+ * Return the extent of a write request that exceeded the threshold,
+ * or zero if the request is below the threshold.
+ * Return zero also if the threshold was not set.
+ *
+ * NOTE: here we assume the following holds for each request this code
+ * deals with:
+ *
+ * assert((req->offset + req->bytes) <= UINT64_MAX)
+ */
+uint64_t bdrv_usage_threshold_exceeded(const BlockDriverState *bs,
+                                       const BdrvTrackedRequest *req);
+
+#endif
diff --git a/qapi/block-core.json b/qapi/block-core.json
index a14e6ab..d9d4e16 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -239,6 +239,9 @@
 #
 # @iops_size: #optional an I/O size in bytes (Since 1.7)
 #
+# @write-threshold: configured write threshold for the device.
+#                   0 if disabled. (Since 2.3)
+#
 # Since: 0.14.0
 #
 ##
@@ -253,7 +256,7 @@
             '*bps_max': 'int', '*bps_rd_max': 'int',
             '*bps_wr_max': 'int', '*iops_max': 'int',
             '*iops_rd_max': 'int', '*iops_wr_max': 'int',
-            '*iops_size': 'int' } }
+            '*iops_size': 'int', 'write-threshold': 'uint64' } }
 
 ##
 # @BlockDeviceIoStatus:
@@ -1826,3 +1829,46 @@
 ##
 { 'enum': 'PreallocMode',
   'data': [ 'off', 'metadata', 'falloc', 'full' ] }
+
+##
+# @BLOCK_USAGE_THRESHOLD
+#
+# Emitted when writes on block device reaches or exceeds the
+# configured threshold. For thin-provisioned devices, this
+# means the device should be extended to avoid pausing for
+# disk exaustion.
+#
+# @node-name: graph node name on which the threshold was exceeded.
+#
+# @amount-exceeded: amount of data which exceeded the threshold, in bytes.
+#
+# @offset-threshold: last configured threshold, in bytes.
+#
+# Since: 2.3
+##
+{ 'event': 'BLOCK_USAGE_THRESHOLD',
+  'data': { 'node-name': 'str',
+	    'amount-exceeded': 'uint64',
+	    'threshold': 'uint64' } }
+
+##
+# @block-set-threshold
+#
+# Change usage threshold for a block drive. An event will be delivered
+# if a write to this block drive crosses the configured threshold.
+# This is useful to transparently resize thin-provisioned drives without
+# the guest OS noticing.
+#
+# @node-name: graph node name on which the threshold must be set.
+#
+# @write-threshold: configured threshold for the block device, bytes.
+#                   Use 0 to disable the threshold.
+#
+# Returns: Nothing on success
+#          If @node name is not found on the block device graph,
+#          DeviceNotFound
+#
+# Since: 2.3
+##
+{ 'command': 'block-set-threshold',
+  'data': { 'node-name': 'str', 'threshold': 'uint64' } }
diff --git a/qmp-commands.hx b/qmp-commands.hx
index 718dd92..d365410 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -3854,3 +3854,31 @@ Move mouse pointer to absolute coordinates (20000, 400).
 <- { "return": {} }
 
 EQMP
+
+    {
+        .name       = "block-set-threshold",
+        .args_type  = "node-name:s,threshold:l",
+        .mhandler.cmd_new = qmp_marshal_input_block_set_threshold,
+    },
+
+SQMP
+block-set-threshold
+------------
+
+Change the write threshold for a block drive. The threshold is an offset,
+thus must be non-negative. Default is not write threshold.
+To set the threshold to zero disables it.
+
+Arguments:
+
+- "node-name": the node name in the block driver state graph (json-string)
+- "threshold": the write threshold in bytes (json-int)
+
+Example:
+
+-> { "execute": "block-set-threshold",
+  "arguments": { "node-name": "mydev",
+                 "threshold": 17179869184 } }
+<- { "return": {} }
+
+EQMP
diff --git a/tests/Makefile b/tests/Makefile
index 16f0e4c..db702ac 100644
--- a/tests/Makefile
+++ b/tests/Makefile
@@ -64,6 +64,8 @@ gcov-files-check-qom-interface-y = qom/object.c
 check-unit-$(CONFIG_POSIX) += tests/test-vmstate$(EXESUF)
 check-unit-y += tests/test-qemu-opts$(EXESUF)
 gcov-files-test-qemu-opts-y = qom/test-qemu-opts.c
+check-unit-y += tests/test-usage-threshold$(EXESUF)
+gcov-files-test-usage-threshold-y = block/usage-threshold.c
 
 check-block-$(CONFIG_POSIX) += tests/qemu-iotests-quick.sh
 
@@ -351,6 +353,7 @@ tests/usb-hcd-xhci-test$(EXESUF): tests/usb-hcd-xhci-test.o $(libqos-usb-obj-y)
 tests/vhost-user-test$(EXESUF): tests/vhost-user-test.o qemu-char.o qemu-timer.o $(qtest-obj-y)
 tests/qemu-iotests/socket_scm_helper$(EXESUF): tests/qemu-iotests/socket_scm_helper.o
 tests/test-qemu-opts$(EXESUF): tests/test-qemu-opts.o libqemuutil.a libqemustub.a
+tests/test-usage-threshold$(EXESUF): tests/test-usage-threshold.o $(block-obj-y) libqemuutil.a libqemustub.a
 
 ifeq ($(CONFIG_POSIX),y)
 LIBS += -lutil
diff --git a/tests/test-usage-threshold.c b/tests/test-usage-threshold.c
new file mode 100644
index 0000000..812c724
--- /dev/null
+++ b/tests/test-usage-threshold.c
@@ -0,0 +1,101 @@
+/*
+ * Test block device usage threshold
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ *
+ */
+
+#include <glib.h>
+#include <stdint.h>
+#include "block/block_int.h"
+#include "block/usage-threshold.h"
+
+
+static void test_threshold_not_set_on_init(void)
+{
+    uint64_t res;
+    BlockDriverState bs;
+    memset(&bs, 0, sizeof(bs));
+
+    g_assert_false(bdrv_usage_threshold_is_set(&bs));
+
+    res = bdrv_usage_threshold_get(&bs);
+    g_assert_cmpint(res, ==, 0);
+}
+
+static void test_threshold_set_get(void)
+{
+    uint64_t threshold = 4 * 1024 * 1024;
+    uint64_t res;
+    BlockDriverState bs;
+    memset(&bs, 0, sizeof(bs));
+
+    bdrv_usage_threshold_set(&bs, threshold);
+
+    g_assert(bdrv_usage_threshold_is_set(&bs));
+
+    res = bdrv_usage_threshold_get(&bs);
+    g_assert_cmpint(res, ==, threshold);
+}
+
+static void test_threshold_multi_set_get(void)
+{
+    uint64_t threshold1 = 4 * 1024 * 1024;
+    uint64_t threshold2 = 15 * 1024 * 1024;
+    uint64_t res;
+    BlockDriverState bs;
+    memset(&bs, 0, sizeof(bs));
+
+    bdrv_usage_threshold_set(&bs, threshold1);
+    bdrv_usage_threshold_set(&bs, threshold2);
+    res = bdrv_usage_threshold_get(&bs);
+    g_assert_cmpint(res, ==, threshold2);
+}
+
+static void test_threshold_not_trigger(void)
+{
+    uint64_t amount = 0;
+    uint64_t threshold = 4 * 1024 * 1024;
+    BlockDriverState bs;
+    BdrvTrackedRequest req;
+
+    memset(&bs, 0, sizeof(bs));
+    memset(&req, 0, sizeof(req));
+    req.offset = 1024;
+    req.bytes = 1024;
+
+    bdrv_usage_threshold_set(&bs, threshold);
+    amount = bdrv_usage_threshold_exceeded(&bs, &req);
+    g_assert_cmpuint(amount, ==, 0);
+}
+
+
+static void test_threshold_trigger(void)
+{
+    uint64_t amount = 0;
+    uint64_t threshold = 4 * 1024 * 1024;
+    BlockDriverState bs;
+    BdrvTrackedRequest req;
+
+    memset(&bs, 0, sizeof(bs));
+    memset(&req, 0, sizeof(req));
+    req.offset = (4 * 1024 * 1024) - 1024;
+    req.bytes = 2 * 1024;
+
+    bdrv_usage_threshold_set(&bs, threshold);
+    amount = bdrv_usage_threshold_exceeded(&bs, &req);
+    g_assert_cmpuint(amount, >=, 1024);
+}
+
+
+int main(int argc, char **argv)
+{
+    g_test_init(&argc, &argv, NULL);
+    g_test_add_func("/usage-threshold/not-set-on-init", test_threshold_not_set_on_init);
+    g_test_add_func("/usage-threshold/set-get", test_threshold_set_get);
+    g_test_add_func("/usage-threshold/multi-set-get", test_threshold_multi_set_get);
+    g_test_add_func("/usage-threshold/not-trigger", test_threshold_not_trigger);
+    g_test_add_func("/usage-threshold/trigger", test_threshold_trigger);
+    return g_test_run();
+}
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold
  2014-11-28 12:31 ` [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold Francesco Romani
@ 2014-12-01 21:07   ` Eric Blake
  2014-12-02  7:47     ` Francesco Romani
  2014-12-02 15:01   ` Stefan Hajnoczi
  1 sibling, 1 reply; 7+ messages in thread
From: Eric Blake @ 2014-12-01 21:07 UTC (permalink / raw)
  To: Francesco Romani, qemu-devel; +Cc: kwolf, mdroth, stefanha, lcapitulino

[-- Attachment #1: Type: text/plain, Size: 6161 bytes --]

On 11/28/2014 05:31 AM, Francesco Romani wrote:
> Managing applications, like oVirt (http://www.ovirt.org), make extensive
> use of thin-provisioned disk images.
> To let the guest run smoothly and be not unnecessarily paused, oVirt sets
> a disk usage threshold (so called 'high water mark') based on the occupation
> of the device,  and automatically extends the image once the threshold
> is reached or exceeded.
> 
> In order to detect the crossing of the threshold, oVirt has no choice but
> aggressively polling the QEMU monitor using the query-blockstats command.
> This lead to unnecessary system load, and is made even worse under scale:
> deployments with hundreds of VMs are no longer rare.
> 
> To fix this, this patch adds:
> * A new monitor command to set a mark for a given block device.
> * A new event to report if a block device usage exceeds the threshold.
> 
> This will allow the managing application to drop the polling
> altogether and just wait for a watermark crossing event.

I like the idea!

Question - what happens if management misses the event (for example, if
libvirtd is restarted)?  Does the existing 'query-blockstats' and/or
'query-named-block-nodes' still work to query the current threshold and
whether it has been exceeded, as a poll-once command executed when
reconnecting to the monitor?

> 
> Signed-off-by: Francesco Romani <fromani@redhat.com>
> ---

No need for a 0/1 cover letter on a 1-patch series; you have the option
of just putting the side-band information here and sending it as a
single mail.  But the cover letter approach doesn't hurt either, and I
can see how it can be easier for some workflows to always send a cover
letter than to special-case a 1-patch series.

> +static int coroutine_fn before_write_notify(NotifierWithReturn *notifier,
> +                                            void *opaque)
> +{
> +    BdrvTrackedRequest *req = opaque;
> +    BlockDriverState *bs = req->bs;
> +    uint64_t amount = 0;
> +
> +    amount = bdrv_usage_threshold_exceeded(bs, req);
> +    if (amount > 0) {
> +        qapi_event_send_block_usage_threshold(
> +            bs->node_name,
> +            amount,
> +            bs->write_threshold_offset,
> +            &error_abort);
> +
> +        /* autodisable to avoid to flood the monitor */

s/to flood/flooding/

> +/*
> + * bdrv_usage_threshold_is_set
> + *
> + * Tell if an usage threshold is set for a given BDS.

s/an usage/a usage/

(in English, the difference between "a" and "an" is whether the leading
sound of the next word is pronounced or not; in this case, "usage" is
pronounced with a hard "yoo-sage".  It may help to remember "an umbrella
for a unicorn")

> +++ b/qapi/block-core.json
> @@ -239,6 +239,9 @@
>  #
>  # @iops_size: #optional an I/O size in bytes (Since 1.7)
>  #
> +# @write-threshold: configured write threshold for the device.
> +#                   0 if disabled. (Since 2.3)
> +#
>  # Since: 0.14.0
>  #
>  ##
> @@ -253,7 +256,7 @@
>              '*bps_max': 'int', '*bps_rd_max': 'int',
>              '*bps_wr_max': 'int', '*iops_max': 'int',
>              '*iops_rd_max': 'int', '*iops_wr_max': 'int',
> -            '*iops_size': 'int' } }
> +            '*iops_size': 'int', 'write-threshold': 'uint64' } }

In QMP specs, 'uint64' and 'int' are practically synonyms.  I can live
with either spelling, although 'int' is more common.

Bikeshed on naming: Although we prefer '-' over '_' in new interfaces,
we also favor consistency, and BlockDeviceInfo is one of those dinosaur
commands that uses _ everywhere until your addition.  So naming this
field 'write_threshold' would be more consistent.

> +##
> +# @BLOCK_USAGE_THRESHOLD
> +#
> +# Emitted when writes on block device reaches or exceeds the
> +# configured threshold. For thin-provisioned devices, this
> +# means the device should be extended to avoid pausing for
> +# disk exaustion.

s/exaustion/exhaustion/

> +#
> +# @node-name: graph node name on which the threshold was exceeded.
> +#
> +# @amount-exceeded: amount of data which exceeded the threshold, in bytes.
> +#
> +# @offset-threshold: last configured threshold, in bytes.
> +#

Might want to mention that this event is one-shot; after it triggers, a
user must re-register a threshold to get the event again.

> +# Since: 2.3
> +##
> +{ 'event': 'BLOCK_USAGE_THRESHOLD',
> +  'data': { 'node-name': 'str',
> +	    'amount-exceeded': 'uint64',

TAB damage.  Please use spaces.  ./scripts/checkpatch.pl will catch some
offenders (although I didn't test if it will catch this one).

However, here you are correct in using '-' for naming :)

> +	    'threshold': 'uint64' } }
> +
> +##
> +# @block-set-threshold
> +#
> +# Change usage threshold for a block drive. An event will be delivered
> +# if a write to this block drive crosses the configured threshold.
> +# This is useful to transparently resize thin-provisioned drives without
> +# the guest OS noticing.
> +#
> +# @node-name: graph node name on which the threshold must be set.
> +#
> +# @write-threshold: configured threshold for the block device, bytes.
> +#                   Use 0 to disable the threshold.
> +#
> +# Returns: Nothing on success
> +#          If @node name is not found on the block device graph,
> +#          DeviceNotFound
> +#
> +# Since: 2.3
> +##
> +{ 'command': 'block-set-threshold',
> +  'data': { 'node-name': 'str', 'threshold': 'uint64' } }

Again, 'int' instead of 'uint64' is more typical, but shouldn't hurt
with either spelling.

> +SQMP
> +block-set-threshold
> +------------
> +
> +Change the write threshold for a block drive. The threshold is an offset,
> +thus must be non-negative. Default is not write threshold.

s/not/no/

> +To set the threshold to zero disables it.

s/To set/Setting/

Missing the change to the 'query-block' and 'query-named-block-nodes'
examples to show the new always-present output field.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 539 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold
  2014-12-01 21:07   ` Eric Blake
@ 2014-12-02  7:47     ` Francesco Romani
  2014-12-02  8:07       ` Francesco Romani
  0 siblings, 1 reply; 7+ messages in thread
From: Francesco Romani @ 2014-12-02  7:47 UTC (permalink / raw)
  To: Eric Blake; +Cc: kwolf, lcapitulino, qemu-devel, stefanha, mdroth

Thanks for the quick review!

----- Original Message -----
> From: "Eric Blake" <eblake@redhat.com>
> To: "Francesco Romani" <fromani@redhat.com>, qemu-devel@nongnu.org
> Cc: kwolf@redhat.com, mdroth@linux.vnet.ibm.com, stefanha@redhat.com, lcapitulino@redhat.com
> Sent: Monday, December 1, 2014 10:07:38 PM
> Subject: Re: [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds	threshold
> 
> On 11/28/2014 05:31 AM, Francesco Romani wrote:
> > Managing applications, like oVirt (http://www.ovirt.org), make extensive
> > use of thin-provisioned disk images.
> > To let the guest run smoothly and be not unnecessarily paused, oVirt sets
> > a disk usage threshold (so called 'high water mark') based on the
> > occupation
> > of the device,  and automatically extends the image once the threshold
> > is reached or exceeded.
> > 
> > In order to detect the crossing of the threshold, oVirt has no choice but
> > aggressively polling the QEMU monitor using the query-blockstats command.
> > This lead to unnecessary system load, and is made even worse under scale:
> > deployments with hundreds of VMs are no longer rare.
> > 
> > To fix this, this patch adds:
> > * A new monitor command to set a mark for a given block device.
> > * A new event to report if a block device usage exceeds the threshold.
> > 
> > This will allow the managing application to drop the polling
> > altogether and just wait for a watermark crossing event.
> 
> I like the idea!
> 
> Question - what happens if management misses the event (for example, if
> libvirtd is restarted)?  Does the existing 'query-blockstats' and/or
> 'query-named-block-nodes' still work to query the current threshold and
> whether it has been exceeded, as a poll-once command executed when
> reconnecting to the monitor?

Indeed oVirt wants to keep the existing polling and to use it as fallback,
to make sure no events are missed. oVirt will not rely on the new notification
*alone*.
The plan is to "just" poll *much less* frequently. Today's default poll
rate is every 2 (two) seconds, so there is a lot of room for improvement.

> 
> > 
> > Signed-off-by: Francesco Romani <fromani@redhat.com>
> > ---
> 
> No need for a 0/1 cover letter on a 1-patch series; you have the option
> of just putting the side-band information here and sending it as a
> single mail.  But the cover letter approach doesn't hurt either, and I
> can see how it can be easier for some workflows to always send a cover
> letter than to special-case a 1-patch series.

Also, I found that a separate context/introduction/room for comments
would helped, especially on first two revisions of the patch which were
tagged as RFC. I have zero problems in dropping the cover letter now that
consensus is forming and patch is taking shape, just let me know.

[... snip spelling: thanks! will fix]
> > +++ b/qapi/block-core.json
> > @@ -239,6 +239,9 @@
> >  #
> >  # @iops_size: #optional an I/O size in bytes (Since 1.7)
> >  #
> > +# @write-threshold: configured write threshold for the device.
> > +#                   0 if disabled. (Since 2.3)
> > +#
> >  # Since: 0.14.0
> >  #
> >  ##
> > @@ -253,7 +256,7 @@
> >              '*bps_max': 'int', '*bps_rd_max': 'int',
> >              '*bps_wr_max': 'int', '*iops_max': 'int',
> >              '*iops_rd_max': 'int', '*iops_wr_max': 'int',
> > -            '*iops_size': 'int' } }
> > +            '*iops_size': 'int', 'write-threshold': 'uint64' } }
> 
> In QMP specs, 'uint64' and 'int' are practically synonyms.  I can live
> with either spelling, although 'int' is more common.
> 
> Bikeshed on naming: Although we prefer '-' over '_' in new interfaces,
> we also favor consistency, and BlockDeviceInfo is one of those dinosaur
> commands that uses _ everywhere until your addition.  So naming this
> field 'write_threshold' would be more consistent.

Agreed. Will fix.

> > +#
> > +# @node-name: graph node name on which the threshold was exceeded.
> > +#
> > +# @amount-exceeded: amount of data which exceeded the threshold, in bytes.
> > +#
> > +# @offset-threshold: last configured threshold, in bytes.
> > +#
> 
> Might want to mention that this event is one-shot; after it triggers, a
> user must re-register a threshold to get the event again.

Good point. Will fix.

> 
> > +# Since: 2.3
> > +##
> > +{ 'event': 'BLOCK_USAGE_THRESHOLD',
> > +  'data': { 'node-name': 'str',
> > +	    'amount-exceeded': 'uint64',
> 
> TAB damage.  Please use spaces.  ./scripts/checkpatch.pl will catch some
> offenders (although I didn't test if it will catch this one).
> 
> However, here you are correct in using '-' for naming :)

Oops. Rebase error (was clean before!) will fix.

> 
> > +	    'threshold': 'uint64' } }
> > +
> > +##
> > +# @block-set-threshold
> > +#
> > +# Change usage threshold for a block drive. An event will be delivered
> > +# if a write to this block drive crosses the configured threshold.
> > +# This is useful to transparently resize thin-provisioned drives without
> > +# the guest OS noticing.
> > +#
> > +# @node-name: graph node name on which the threshold must be set.
> > +#
> > +# @write-threshold: configured threshold for the block device, bytes.
> > +#                   Use 0 to disable the threshold.
> > +#
> > +# Returns: Nothing on success
> > +#          If @node name is not found on the block device graph,
> > +#          DeviceNotFound
> > +#
> > +# Since: 2.3
> > +##
> > +{ 'command': 'block-set-threshold',
> > +  'data': { 'node-name': 'str', 'threshold': 'uint64' } }
> 
> Again, 'int' instead of 'uint64' is more typical, but shouldn't hurt
> with either spelling.
> 
> > +SQMP
> > +block-set-threshold
> > +------------
> > +
> > +Change the write threshold for a block drive. The threshold is an offset,
> > +thus must be non-negative. Default is not write threshold.
> 
> s/not/no/
> 
> > +To set the threshold to zero disables it.
> 
> s/To set/Setting/
> 
> Missing the change to the 'query-block' and 'query-named-block-nodes'
> examples to show the new always-present output field.


Sorry, not sure I'm following. Isn't that hunk enough?

diff --git a/block/qapi.c b/block/qapi.c
index a87a34a..c85abb0 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -24,6 +24,7 @@
 
 #include "block/qapi.h"
 #include "block/block_int.h"
+#include "block/usage-threshold.h"
 #include "qmp-commands.h"
 #include "qapi-visit.h"
 #include "qapi/qmp-output-visitor.h"
@@ -82,6 +83,8 @@ BlockDeviceInfo *bdrv_block_device_info(BlockDriverState *bs)
         info->iops_size = cfg.op_size;
     }
 
+    info->write_threshold = bdrv_usage_threshold_get(bs);
+
     return info;
 }
 

It produces the output I'd expect on my tests:

Welcome to the QMP low-level shell!
Connected to QEMU 2.1.91

(QEMU) query-block
{   u'return': [   {   u'device': u'drive-virtio-disk0',
                       u'inserted': {   u'backing_file_depth': 0,
                                        u'bps': 0,
                                        u'bps_rd': 0,
                                        u'bps_wr': 0,
                                        u'detect_zeroes': u'off',
                                        u'drv': u'qcow2',
                                        u'encrypted': False,
                                        u'encryption_key_missing': False,
                                        u'file': u'/var/lib/libvirt/images/fedora20.qcow2',
                                        u'image': {   u'actual-size': 11804672,
                                                      u'cluster-size': 65536,
                                                      u'dirty-flag': False,
                                                      u'filename': u'/var/lib/libvirt/images/fedora20.qcow2',
                                                      u'format': u'qcow2',
                                                      u'format-specific': {   u'data': {   u'compat': u'1.1',
                                                                                           u'corrupt': False,
                                                                                           u'lazy-refcounts': False},
                                                                              u'type': u'qcow2'},
                                                      u'virtual-size': 17179869184},
                                        u'iops': 0,
                                        u'iops_rd': 0,
                                        u'iops_wr': 0,
                                        u'node-name': u'foo',
                                        u'ro': False,
                                        u'write-threshold': 0},
                       u'io-status': u'ok',
                       u'locked': False,
                       u'removable': False,
                       u'type': u'unknown'},
                   {   u'device': u'drive-ide0-0-0',
                       u'inserted': {   u'backing_file_depth': 0,
                                        u'bps': 0,
                                        u'bps_rd': 0,
                                        u'bps_wr': 0,
                                        u'detect_zeroes': u'off',
                                        u'drv': u'raw',
                                        u'encrypted': False,
                                        u'encryption_key_missing': False,
                                        u'file': u'/var/lib/libvirt/images/Fedora-Live-Desktop-x86_64-20-1.iso',
                                        u'image': {   u'actual-size': 999297024,
                                                      u'dirty-flag': False,
                                                      u'filename': u'/var/lib/libvirt/images/Fedora-Live-Desktop-x86_64-20-1.iso',
                                                      u'format': u'raw',
                                                      u'virtual-size': 999292928},
                                        u'iops': 0,
                                        u'iops_rd': 0,
                                        u'iops_wr': 0,
                                        u'ro': True,
                                        u'write-threshold': 0},
                       u'io-status': u'ok',
                       u'locked': False,
                       u'removable': True,
                       u'tray_open': False,
                       u'type': u'unknown'}]}

(QEMU) query-named-block-nodes
{   u'return': [   {   u'backing_file_depth': 0,
                       u'bps': 0,
                       u'bps_rd': 0,
                       u'bps_wr': 0,
                       u'detect_zeroes': u'off',
                       u'drv': u'qcow2',
                       u'encrypted': False,
                       u'encryption_key_missing': False,
                       u'file': u'/var/lib/libvirt/images/fedora20.qcow2',
                       u'image': {   },
                       u'iops': 0,
                       u'iops_rd': 0,
                       u'iops_wr': 0,
                       u'node-name': u'foo',
                       u'ro': False,
                       u'write-threshold': 0},
                   {   u'backing_file_depth': 0,
                       u'bps': 0,
                       u'bps_rd': 0,
                       u'bps_wr': 0,
                       u'detect_zeroes': u'off',
                       u'drv': u'file',
                       u'encrypted': False,
                       u'encryption_key_missing': False,
                       u'file': u'/var/lib/libvirt/images/fedora20.qcow2',
                       u'image': {   },
                       u'iops': 0,
                       u'iops_rd': 0,
                       u'iops_wr': 0,
                       u'node-name': u'bar',
                       u'ro': False,
                       u'write-threshold': 0}]}


(relevant cmd-line snippet:
-drive file=/var/lib/libvirt/images/fedora20.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,node-name=foo,file.node-name=bar \
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=2 \
)

Bests,

-- 
Francesco Romani
RedHat Engineering Virtualization R & D
Phone: 8261328
IRC: fromani

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold
  2014-12-02  7:47     ` Francesco Romani
@ 2014-12-02  8:07       ` Francesco Romani
  0 siblings, 0 replies; 7+ messages in thread
From: Francesco Romani @ 2014-12-02  8:07 UTC (permalink / raw)
  To: Eric Blake; +Cc: kwolf, mdroth, qemu-devel, stefanha, lcapitulino

----- Original Message -----
> From: "Francesco Romani" <fromani@redhat.com>
> To: "Eric Blake" <eblake@redhat.com>
> Cc: kwolf@redhat.com, lcapitulino@redhat.com, qemu-devel@nongnu.org, stefanha@redhat.com, mdroth@linux.vnet.ibm.com
> Sent: Tuesday, December 2, 2014 8:47:45 AM
> Subject: Re: [Qemu-devel] [PATCHv3] block: add event when disk usage	exceeds	threshold
> 
> Thanks for the quick review!

> > Missing the change to the 'query-block' and 'query-named-block-nodes'
> > examples to show the new always-present output field.
> 

Sorry, I missed the *examples* keyword. Will fix.

-- 
Francesco Romani
RedHat Engineering Virtualization R & D
Phone: 8261328
IRC: fromani

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold
  2014-11-28 12:31 ` [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold Francesco Romani
  2014-12-01 21:07   ` Eric Blake
@ 2014-12-02 15:01   ` Stefan Hajnoczi
  2014-12-02 15:11     ` Francesco Romani
  1 sibling, 1 reply; 7+ messages in thread
From: Stefan Hajnoczi @ 2014-12-02 15:01 UTC (permalink / raw)
  To: Francesco Romani; +Cc: kwolf, mdroth, qemu-devel, lcapitulino

[-- Attachment #1: Type: text/plain, Size: 441 bytes --]

On Fri, Nov 28, 2014 at 01:31:07PM +0100, Francesco Romani wrote:
> @@ -82,6 +83,8 @@ BlockDeviceInfo *bdrv_block_device_info(BlockDriverState *bs)
>          info->iops_size = cfg.op_size;
>      }
>  
> +    info->write_threshold = bdrv_usage_threshold_get(bs);

Overall looks good but I notice that "write_threshold" and
"usage_threshold" are both used.  Please use just one consistently (I
think "write_threshold" is clearer).

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold
  2014-12-02 15:01   ` Stefan Hajnoczi
@ 2014-12-02 15:11     ` Francesco Romani
  0 siblings, 0 replies; 7+ messages in thread
From: Francesco Romani @ 2014-12-02 15:11 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, lcapitulino, mdroth, qemu-devel



----- Original Message -----
> From: "Stefan Hajnoczi" <stefanha@redhat.com>
> To: "Francesco Romani" <fromani@redhat.com>
> Cc: kwolf@redhat.com, mdroth@linux.vnet.ibm.com, qemu-devel@nongnu.org, lcapitulino@redhat.com
> Sent: Tuesday, December 2, 2014 4:01:17 PM
> Subject: Re: [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds	threshold
> 
> On Fri, Nov 28, 2014 at 01:31:07PM +0100, Francesco Romani wrote:
> > @@ -82,6 +83,8 @@ BlockDeviceInfo *bdrv_block_device_info(BlockDriverState
> > *bs)
> >          info->iops_size = cfg.op_size;
> >      }
> >  
> > +    info->write_threshold = bdrv_usage_threshold_get(bs);
> 
> Overall looks good but I notice that "write_threshold" and
> "usage_threshold" are both used.  Please use just one consistently (I
> think "write_threshold" is clearer).

Agreed. Will use "write_threshold" everywhere, file names included.

Bests,

-- 
Francesco Romani
RedHat Engineering Virtualization R & D
Phone: 8261328
IRC: fromani

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-12-02 15:12 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-28 12:31 [Qemu-devel] [PATCH v3] add write threshold reporting for block devices Francesco Romani
2014-11-28 12:31 ` [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold Francesco Romani
2014-12-01 21:07   ` Eric Blake
2014-12-02  7:47     ` Francesco Romani
2014-12-02  8:07       ` Francesco Romani
2014-12-02 15:01   ` Stefan Hajnoczi
2014-12-02 15:11     ` Francesco Romani

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.