qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/7] block: Add retry for werror=/rerror= mechanism
@ 2020-12-15 12:30 Jiahui Cen
  2020-12-15 12:30 ` [PATCH v4 1/7] qapi/block-core: Add retry option for error action Jiahui Cen
                   ` (9 more replies)
  0 siblings, 10 replies; 12+ messages in thread
From: Jiahui Cen @ 2020-12-15 12:30 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, cenjiahui, zhang.zhanghailiang, qemu-block,
	Michael S. Tsirkin, Markus Armbruster, Max Reitz,
	Stefan Hajnoczi, fangying1, John Snow

A VM in the cloud environment may use a virutal disk as the backend storage,
and there are usually filesystems on the virtual block device. When backend
storage is temporarily down, any I/O issued to the virtual block device
will cause an error. For example, an error occurred in ext4 filesystem would
make the filesystem readonly. In production environment, a cloud backend
storage can be soon recovered. For example, an IP-SAN may be down due to
network failure and will be online soon after network is recovered. However,
the error in the filesystem may not be recovered unless a device reattach
or system restart. Thus an I/O retry mechanism is in need to implement a
self-healing system.

This patch series propose to extend the werror=/rerror= mechanism to add
a 'retry' feature. It can automatically retry failed I/O requests on error
without sending error back to guest, and guest can get back running smoothly
when I/O is recovred.

v3->v4:
* Adapt to werror=/rerror= mechanism.

v2->v3:
* Add a doc to describe I/O hang.

v1->v2:
* Rebase to fix compile problems.
* Fix incorrect remove of rehandle list.
* Provide rehandle pause interface.

REF: https://lists.gnu.org/archive/html/qemu-devel/2020-10/msg06560.html

Signed-off-by: Jiahui Cen <cenjiahui@huawei.com>
Signed-off-by: Ying Fang <fangying1@huawei.com>

Jiahui Cen (7):
  qapi/block-core: Add retry option for error action
  block-backend: Introduce retry timer
  block-backend: Add device specific retry callback
  block-backend: Enable retry action on errors
  block-backend: Add timeout support for retry
  block: Add error retry param setting
  virtio_blk: Add support for retry on errors

 block/block-backend.c          | 66 ++++++++++++++++++++
 blockdev.c                     | 52 +++++++++++++++
 hw/block/block.c               | 10 +++
 hw/block/virtio-blk.c          | 19 +++++-
 include/hw/block/block.h       |  7 ++-
 include/sysemu/block-backend.h | 10 +++
 qapi/block-core.json           |  4 +-
 7 files changed, 162 insertions(+), 6 deletions(-)

-- 
2.28.0



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v4 1/7] qapi/block-core: Add retry option for error action
  2020-12-15 12:30 [PATCH v4 0/7] block: Add retry for werror=/rerror= mechanism Jiahui Cen
@ 2020-12-15 12:30 ` Jiahui Cen
  2021-01-27 17:16   ` Eric Blake
  2020-12-15 12:30 ` [PATCH v4 2/7] block-backend: Introduce retry timer Jiahui Cen
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 12+ messages in thread
From: Jiahui Cen @ 2020-12-15 12:30 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, cenjiahui, zhang.zhanghailiang, qemu-block,
	Michael S. Tsirkin, Markus Armbruster, Max Reitz,
	Stefan Hajnoczi, fangying1, John Snow

Add a new error action 'retry' to support retry on errors.

Signed-off-by: Jiahui Cen <cenjiahui@huawei.com>
Signed-off-by: Ying Fang <fangying1@huawei.com>
---
 blockdev.c           | 2 ++
 qapi/block-core.json | 4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 412354b4b6..47c0e6db52 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -342,6 +342,8 @@ static int parse_block_error_action(const char *buf, bool is_read, Error **errp)
         return BLOCKDEV_ON_ERROR_STOP;
     } else if (!strcmp(buf, "report")) {
         return BLOCKDEV_ON_ERROR_REPORT;
+    } else if (!strcmp(buf, "retry")) {
+        return BLOCKDEV_ON_ERROR_RETRY;
     } else {
         error_setg(errp, "'%s' invalid %s error action",
                    buf, is_read ? "read" : "write");
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 04c5196e59..ef5492bcdf 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1146,7 +1146,7 @@
 # Since: 1.3
 ##
 { 'enum': 'BlockdevOnError',
-  'data': ['report', 'ignore', 'enospc', 'stop', 'auto'] }
+  'data': ['report', 'ignore', 'enospc', 'stop', 'auto', 'retry'] }
 
 ##
 # @MirrorSyncMode:
@@ -4770,7 +4770,7 @@
 # Since: 2.1
 ##
 { 'enum': 'BlockErrorAction',
-  'data': [ 'ignore', 'report', 'stop' ] }
+  'data': [ 'ignore', 'report', 'stop', 'retry' ] }
 
 
 ##
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v4 2/7] block-backend: Introduce retry timer
  2020-12-15 12:30 [PATCH v4 0/7] block: Add retry for werror=/rerror= mechanism Jiahui Cen
  2020-12-15 12:30 ` [PATCH v4 1/7] qapi/block-core: Add retry option for error action Jiahui Cen
@ 2020-12-15 12:30 ` Jiahui Cen
  2020-12-15 12:30 ` [PATCH v4 3/7] block-backend: Add device specific retry callback Jiahui Cen
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Jiahui Cen @ 2020-12-15 12:30 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, cenjiahui, zhang.zhanghailiang, qemu-block,
	Michael S. Tsirkin, Markus Armbruster, Max Reitz,
	Stefan Hajnoczi, fangying1, John Snow

Add a timer to regularly trigger retry on errors.

Signed-off-by: Jiahui Cen <cenjiahui@huawei.com>
Signed-off-by: Ying Fang <fangying1@huawei.com>
---
 block/block-backend.c | 21 ++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/block/block-backend.c b/block/block-backend.c
index ce78d30794..fe775ea298 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -35,6 +35,9 @@
 
 static AioContext *blk_aiocb_get_aio_context(BlockAIOCB *acb);
 
+/* block backend default retry interval */
+#define BLOCK_BACKEND_DEFAULT_RETRY_INTERVAL   1000
+
 typedef struct BlockBackendAioNotifier {
     void (*attached_aio_context)(AioContext *new_context, void *opaque);
     void (*detach_aio_context)(void *opaque);
@@ -95,6 +98,15 @@ struct BlockBackend {
      * Accessed with atomic ops.
      */
     unsigned int in_flight;
+
+    /* Timer for retry on errors. */
+    QEMUTimer *retry_timer;
+    /* Interval in ms to trigger next retry. */
+    int64_t retry_interval;
+    /* Start time of the first error. Used to check timeout. */
+    int64_t retry_start_time;
+    /* Retry timeout. 0 represents infinite retry. */
+    int64_t retry_timeout;
 };
 
 typedef struct BlockBackendAIOCB {
@@ -345,6 +357,11 @@ BlockBackend *blk_new(AioContext *ctx, uint64_t perm, uint64_t shared_perm)
     blk->on_read_error = BLOCKDEV_ON_ERROR_REPORT;
     blk->on_write_error = BLOCKDEV_ON_ERROR_ENOSPC;
 
+    blk->retry_timer = NULL;
+    blk->retry_interval = BLOCK_BACKEND_DEFAULT_RETRY_INTERVAL;
+    blk->retry_start_time = 0;
+    blk->retry_timeout = 0;
+
     block_acct_init(&blk->stats);
 
     qemu_co_queue_init(&blk->queued_requests);
@@ -456,6 +473,10 @@ static void blk_delete(BlockBackend *blk)
     QTAILQ_REMOVE(&block_backends, blk, link);
     drive_info_del(blk->legacy_dinfo);
     block_acct_cleanup(&blk->stats);
+    if (blk->retry_timer) {
+        timer_del(blk->retry_timer);
+        timer_free(blk->retry_timer);
+    }
     g_free(blk);
 }
 
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v4 3/7] block-backend: Add device specific retry callback
  2020-12-15 12:30 [PATCH v4 0/7] block: Add retry for werror=/rerror= mechanism Jiahui Cen
  2020-12-15 12:30 ` [PATCH v4 1/7] qapi/block-core: Add retry option for error action Jiahui Cen
  2020-12-15 12:30 ` [PATCH v4 2/7] block-backend: Introduce retry timer Jiahui Cen
@ 2020-12-15 12:30 ` Jiahui Cen
  2020-12-15 12:30 ` [PATCH v4 4/7] block-backend: Enable retry action on errors Jiahui Cen
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Jiahui Cen @ 2020-12-15 12:30 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, cenjiahui, zhang.zhanghailiang, qemu-block,
	Michael S. Tsirkin, Markus Armbruster, Max Reitz,
	Stefan Hajnoczi, fangying1, John Snow

Add retry_request_cb in BlockDevOps to do device specific retry action.
Backend's timer would be registered only when the backend is set 'retry'
on errors and the device supports retry action.

Signed-off-by: Jiahui Cen <cenjiahui@huawei.com>
Signed-off-by: Ying Fang <fangying1@huawei.com>
---
 block/block-backend.c          | 8 ++++++++
 include/sysemu/block-backend.h | 4 ++++
 2 files changed, 12 insertions(+)

diff --git a/block/block-backend.c b/block/block-backend.c
index fe775ea298..bca7c581ee 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -995,6 +995,14 @@ void blk_set_dev_ops(BlockBackend *blk, const BlockDevOps *ops,
     blk->dev_ops = ops;
     blk->dev_opaque = opaque;
 
+    if ((blk->on_read_error == BLOCKDEV_ON_ERROR_RETRY ||
+         blk->on_write_error == BLOCKDEV_ON_ERROR_RETRY) &&
+        ops->retry_request_cb) {
+        blk->retry_timer = aio_timer_new(blk->ctx, QEMU_CLOCK_REALTIME,
+                                         SCALE_MS, ops->retry_request_cb,
+                                         opaque);
+    }
+
     /* Are we currently quiesced? Should we enforce this right now? */
     if (blk->quiesce_counter && ops->drained_begin) {
         ops->drained_begin(opaque);
diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index 8203d7f6f9..b31144aca9 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -66,6 +66,10 @@ typedef struct BlockDevOps {
      * Runs when the backend's last drain request ends.
      */
     void (*drained_end)(void *opaque);
+    /*
+     * Runs when retrying failed requests.
+     */
+    void (*retry_request_cb)(void *opaque);
 } BlockDevOps;
 
 /* This struct is embedded in (the private) BlockBackend struct and contains
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v4 4/7] block-backend: Enable retry action on errors
  2020-12-15 12:30 [PATCH v4 0/7] block: Add retry for werror=/rerror= mechanism Jiahui Cen
                   ` (2 preceding siblings ...)
  2020-12-15 12:30 ` [PATCH v4 3/7] block-backend: Add device specific retry callback Jiahui Cen
@ 2020-12-15 12:30 ` Jiahui Cen
  2020-12-15 12:30 ` [PATCH v4 5/7] block-backend: Add timeout support for retry Jiahui Cen
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Jiahui Cen @ 2020-12-15 12:30 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, cenjiahui, zhang.zhanghailiang, qemu-block,
	Michael S. Tsirkin, Markus Armbruster, Max Reitz,
	Stefan Hajnoczi, fangying1, John Snow

Enable retry action when backend's retry timer is available. It would
trigger the timer to do device specific retry action.

Signed-off-by: Jiahui Cen <cenjiahui@huawei.com>
Signed-off-by: Ying Fang <fangying1@huawei.com>
---
 block/block-backend.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/block/block-backend.c b/block/block-backend.c
index bca7c581ee..9c6e50e568 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1803,6 +1803,9 @@ BlockErrorAction blk_get_error_action(BlockBackend *blk, bool is_read,
         return BLOCK_ERROR_ACTION_REPORT;
     case BLOCKDEV_ON_ERROR_IGNORE:
         return BLOCK_ERROR_ACTION_IGNORE;
+    case BLOCKDEV_ON_ERROR_RETRY:
+        return (blk->retry_timer) ?
+               BLOCK_ERROR_ACTION_RETRY : BLOCK_ERROR_ACTION_REPORT;
     case BLOCKDEV_ON_ERROR_AUTO:
     default:
         abort();
@@ -1850,6 +1853,10 @@ void blk_error_action(BlockBackend *blk, BlockErrorAction action,
         qemu_system_vmstop_request_prepare();
         send_qmp_error_event(blk, action, is_read, error);
         qemu_system_vmstop_request(RUN_STATE_IO_ERROR);
+    } else if (action == BLOCK_ERROR_ACTION_RETRY) {
+        timer_mod(blk->retry_timer, qemu_clock_get_ms(QEMU_CLOCK_REALTIME) +
+                                    blk->retry_interval);
+        send_qmp_error_event(blk, action, is_read, error);
     } else {
         send_qmp_error_event(blk, action, is_read, error);
     }
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v4 5/7] block-backend: Add timeout support for retry
  2020-12-15 12:30 [PATCH v4 0/7] block: Add retry for werror=/rerror= mechanism Jiahui Cen
                   ` (3 preceding siblings ...)
  2020-12-15 12:30 ` [PATCH v4 4/7] block-backend: Enable retry action on errors Jiahui Cen
@ 2020-12-15 12:30 ` Jiahui Cen
  2020-12-15 12:30 ` [PATCH v4 6/7] block: Add error retry param setting Jiahui Cen
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Jiahui Cen @ 2020-12-15 12:30 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, cenjiahui, zhang.zhanghailiang, qemu-block,
	Michael S. Tsirkin, Markus Armbruster, Max Reitz,
	Stefan Hajnoczi, fangying1, John Snow

Retry should only be triggered when timeout is not reached, so let's check
timeout before retry. Device should also reset retry_start_time after
successful retry.

Signed-off-by: Jiahui Cen <cenjiahui@huawei.com>
Signed-off-by: Ying Fang <fangying1@huawei.com>
---
 block/block-backend.c          | 25 +++++++++++++++++++-
 include/sysemu/block-backend.h |  1 +
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 9c6e50e568..f5386fabb9 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1776,6 +1776,29 @@ void blk_drain_all(void)
     bdrv_drain_all_end();
 }
 
+static bool blk_error_retry_timeout(BlockBackend *blk)
+{
+    /* No timeout set, infinite retries. */
+    if (!blk->retry_timeout) {
+        return false;
+    }
+
+    /* The first time an error occurs. */
+    if (!blk->retry_start_time) {
+        blk->retry_start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+        return false;
+    }
+
+    return qemu_clock_get_ms(QEMU_CLOCK_REALTIME) > (blk->retry_start_time +
+                                                     blk->retry_timeout);
+}
+
+void blk_error_retry_reset_timeout(BlockBackend *blk)
+{
+    if (blk->retry_timer && blk->retry_start_time)
+        blk->retry_start_time = 0;
+}
+
 void blk_set_on_error(BlockBackend *blk, BlockdevOnError on_read_error,
                       BlockdevOnError on_write_error)
 {
@@ -1804,7 +1827,7 @@ BlockErrorAction blk_get_error_action(BlockBackend *blk, bool is_read,
     case BLOCKDEV_ON_ERROR_IGNORE:
         return BLOCK_ERROR_ACTION_IGNORE;
     case BLOCKDEV_ON_ERROR_RETRY:
-        return (blk->retry_timer) ?
+        return (blk->retry_timer && !blk_error_retry_timeout(blk)) ?
                BLOCK_ERROR_ACTION_RETRY : BLOCK_ERROR_ACTION_REPORT;
     case BLOCKDEV_ON_ERROR_AUTO:
     default:
diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index b31144aca9..070eb7786c 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -188,6 +188,7 @@ void blk_inc_in_flight(BlockBackend *blk);
 void blk_dec_in_flight(BlockBackend *blk);
 void blk_drain(BlockBackend *blk);
 void blk_drain_all(void);
+void blk_error_retry_reset_timeout(BlockBackend *blk);
 void blk_set_on_error(BlockBackend *blk, BlockdevOnError on_read_error,
                       BlockdevOnError on_write_error);
 BlockdevOnError blk_get_on_error(BlockBackend *blk, bool is_read);
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v4 6/7] block: Add error retry param setting
  2020-12-15 12:30 [PATCH v4 0/7] block: Add retry for werror=/rerror= mechanism Jiahui Cen
                   ` (4 preceding siblings ...)
  2020-12-15 12:30 ` [PATCH v4 5/7] block-backend: Add timeout support for retry Jiahui Cen
@ 2020-12-15 12:30 ` Jiahui Cen
  2020-12-15 12:30 ` [PATCH v4 7/7] virtio_blk: Add support for retry on errors Jiahui Cen
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Jiahui Cen @ 2020-12-15 12:30 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, cenjiahui, zhang.zhanghailiang, qemu-block,
	Michael S. Tsirkin, Markus Armbruster, Max Reitz,
	Stefan Hajnoczi, fangying1, John Snow

Add "retry_interval" and "retry_timeout" parameter for drive and device
option. These parameter are valid only when werror/rerror=retry.

eg. --drive file=image,rerror=retry,retry_interval=1000,retry_timeout=5000

Signed-off-by: Jiahui Cen <cenjiahui@huawei.com>
Signed-off-by: Ying Fang <fangying1@huawei.com>
---
 block/block-backend.c          | 13 +++--
 blockdev.c                     | 50 ++++++++++++++++++++
 hw/block/block.c               | 10 ++++
 include/hw/block/block.h       |  7 ++-
 include/sysemu/block-backend.h |  5 ++
 5 files changed, 81 insertions(+), 4 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index f5386fabb9..230b1c65b5 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -35,9 +35,6 @@
 
 static AioContext *blk_aiocb_get_aio_context(BlockAIOCB *acb);
 
-/* block backend default retry interval */
-#define BLOCK_BACKEND_DEFAULT_RETRY_INTERVAL   1000
-
 typedef struct BlockBackendAioNotifier {
     void (*attached_aio_context)(AioContext *new_context, void *opaque);
     void (*detach_aio_context)(void *opaque);
@@ -1776,6 +1773,16 @@ void blk_drain_all(void)
     bdrv_drain_all_end();
 }
 
+void blk_set_on_error_retry_interval(BlockBackend *blk, int64_t interval)
+{
+    blk->retry_interval = interval;
+}
+
+void blk_set_on_error_retry_timeout(BlockBackend *blk, int64_t timeout)
+{
+    blk->retry_timeout = timeout;
+}
+
 static bool blk_error_retry_timeout(BlockBackend *blk)
 {
     /* No timeout set, infinite retries. */
diff --git a/blockdev.c b/blockdev.c
index 47c0e6db52..39c0669981 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -489,6 +489,7 @@ static BlockBackend *blockdev_init(const char *file, QDict *bs_opts,
     const char *buf;
     int bdrv_flags = 0;
     int on_read_error, on_write_error;
+    int64_t retry_interval, retry_timeout;
     bool account_invalid, account_failed;
     bool writethrough, read_only;
     BlockBackend *blk;
@@ -581,6 +582,10 @@ static BlockBackend *blockdev_init(const char *file, QDict *bs_opts,
         }
     }
 
+    retry_interval = qemu_opt_get_number(opts, "retry_interval",
+                                         BLOCK_BACKEND_DEFAULT_RETRY_INTERVAL);
+    retry_timeout = qemu_opt_get_number(opts, "retry_timeout", 0);
+
     if (snapshot) {
         bdrv_flags |= BDRV_O_SNAPSHOT;
     }
@@ -645,6 +650,11 @@ static BlockBackend *blockdev_init(const char *file, QDict *bs_opts,
 
     blk_set_enable_write_cache(blk, !writethrough);
     blk_set_on_error(blk, on_read_error, on_write_error);
+    if (on_read_error == BLOCKDEV_ON_ERROR_RETRY ||
+        on_write_error == BLOCKDEV_ON_ERROR_RETRY) {
+        blk_set_on_error_retry_interval(blk, retry_interval);
+        blk_set_on_error_retry_timeout(blk, retry_timeout);
+    }
 
     if (!monitor_add_blk(blk, id, errp)) {
         blk_unref(blk);
@@ -771,6 +781,14 @@ QemuOptsList qemu_legacy_drive_opts = {
             .name = "werror",
             .type = QEMU_OPT_STRING,
             .help = "write error action",
+        },{
+            .name = "retry_interval",
+            .type = QEMU_OPT_NUMBER,
+            .help = "interval for retry action in millisecond",
+        },{
+            .name = "retry_timeout",
+            .type = QEMU_OPT_NUMBER,
+            .help = "timeout for retry action in millisecond",
         },{
             .name = "copy-on-read",
             .type = QEMU_OPT_BOOL,
@@ -793,6 +811,7 @@ DriveInfo *drive_new(QemuOpts *all_opts, BlockInterfaceType block_default_type,
     BlockInterfaceType type;
     int max_devs, bus_id, unit_id, index;
     const char *werror, *rerror;
+    int64_t retry_interval, retry_timeout;
     bool read_only = false;
     bool copy_on_read;
     const char *filename;
@@ -1004,6 +1023,29 @@ DriveInfo *drive_new(QemuOpts *all_opts, BlockInterfaceType block_default_type,
         qdict_put_str(bs_opts, "rerror", rerror);
     }
 
+    if (qemu_opt_find(legacy_opts, "retry_interval")) {
+        if ((werror == NULL || strcmp(werror, "retry")) &&
+            (rerror == NULL || strcmp(rerror, "retry"))) {
+            error_setg(errp, "retry_interval is only supported "
+                             "by werror/rerror=retry");
+            goto fail;
+        }
+        retry_interval = qemu_opt_get_number(legacy_opts, "retry_interval",
+                             BLOCK_BACKEND_DEFAULT_RETRY_INTERVAL);
+        qdict_put_int(bs_opts, "retry_interval", retry_interval);
+    }
+
+    if (qemu_opt_find(legacy_opts, "retry_timeout")) {
+        if ((werror == NULL || strcmp(werror, "retry")) &&
+            (rerror == NULL || strcmp(rerror, "retry"))) {
+            error_setg(errp, "retry_timeout is only supported "
+                             "by werror/rerror=retry");
+            goto fail;
+        }
+        retry_timeout = qemu_opt_get_number(legacy_opts, "retry_timeout", 0);
+        qdict_put_int(bs_opts, "retry_timeout", retry_timeout);
+    }
+
     /* Actual block device init: Functionality shared with blockdev-add */
     blk = blockdev_init(filename, bs_opts, errp);
     bs_opts = NULL;
@@ -3773,6 +3815,14 @@ QemuOptsList qemu_common_drive_opts = {
             .name = "werror",
             .type = QEMU_OPT_STRING,
             .help = "write error action",
+        },{
+            .name = "retry_interval",
+            .type = QEMU_OPT_NUMBER,
+            .help = "interval for retry action in millisecond",
+        },{
+            .name = "retry_timeout",
+            .type = QEMU_OPT_NUMBER,
+            .help = "timeout for retry action in millisecond",
         },{
             .name = BDRV_OPT_READ_ONLY,
             .type = QEMU_OPT_BOOL,
diff --git a/hw/block/block.c b/hw/block/block.c
index 1e34573da7..d2f35dc465 100644
--- a/hw/block/block.c
+++ b/hw/block/block.c
@@ -172,6 +172,16 @@ bool blkconf_apply_backend_options(BlockConf *conf, bool readonly,
     blk_set_enable_write_cache(blk, wce);
     blk_set_on_error(blk, rerror, werror);
 
+    if (rerror == BLOCKDEV_ON_ERROR_RETRY ||
+        werror == BLOCKDEV_ON_ERROR_RETRY) {
+        if (conf->retry_interval >= 0) {
+            blk_set_on_error_retry_interval(blk, conf->retry_interval);
+        }
+        if (conf->retry_timeout >= 0) {
+            blk_set_on_error_retry_timeout(blk, conf->retry_timeout);
+        }
+    }
+
     return true;
 }
 
diff --git a/include/hw/block/block.h b/include/hw/block/block.h
index 1e8b6253dd..a9f04db147 100644
--- a/include/hw/block/block.h
+++ b/include/hw/block/block.h
@@ -31,6 +31,8 @@ typedef struct BlockConf {
     bool share_rw;
     BlockdevOnError rerror;
     BlockdevOnError werror;
+    int64_t retry_interval;
+    int64_t retry_timeout;
 } BlockConf;
 
 static inline unsigned int get_physical_block_exp(BlockConf *conf)
@@ -75,7 +77,10 @@ static inline unsigned int get_physical_block_exp(BlockConf *conf)
     DEFINE_PROP_BLOCKDEV_ON_ERROR("rerror", _state, _conf.rerror,       \
                                   BLOCKDEV_ON_ERROR_AUTO),              \
     DEFINE_PROP_BLOCKDEV_ON_ERROR("werror", _state, _conf.werror,       \
-                                  BLOCKDEV_ON_ERROR_AUTO)
+                                  BLOCKDEV_ON_ERROR_AUTO),              \
+    DEFINE_PROP_INT64("retry_interval", _state, _conf.retry_interval,   \
+                      -1),                                              \
+    DEFINE_PROP_INT64("retry_timeout", _state, _conf.retry_timeout, -1)
 
 /* Backend access helpers */
 
diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index 070eb7786c..a82b6da1da 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -25,6 +25,9 @@
  */
 #include "block/block.h"
 
+/* block backend default retry interval */
+#define BLOCK_BACKEND_DEFAULT_RETRY_INTERVAL   1000
+
 /* Callbacks for block device models */
 typedef struct BlockDevOps {
     /*
@@ -188,6 +191,8 @@ void blk_inc_in_flight(BlockBackend *blk);
 void blk_dec_in_flight(BlockBackend *blk);
 void blk_drain(BlockBackend *blk);
 void blk_drain_all(void);
+void blk_set_on_error_retry_interval(BlockBackend *blk, int64_t interval);
+void blk_set_on_error_retry_timeout(BlockBackend *blk, int64_t timeout);
 void blk_error_retry_reset_timeout(BlockBackend *blk);
 void blk_set_on_error(BlockBackend *blk, BlockdevOnError on_read_error,
                       BlockdevOnError on_write_error);
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v4 7/7] virtio_blk: Add support for retry on errors
  2020-12-15 12:30 [PATCH v4 0/7] block: Add retry for werror=/rerror= mechanism Jiahui Cen
                   ` (5 preceding siblings ...)
  2020-12-15 12:30 ` [PATCH v4 6/7] block: Add error retry param setting Jiahui Cen
@ 2020-12-15 12:30 ` Jiahui Cen
  2020-12-21  7:57 ` [PATCH v4 0/7] block: Add retry for werror=/rerror= mechanism Jiahui Cen
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Jiahui Cen @ 2020-12-15 12:30 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, cenjiahui, zhang.zhanghailiang, qemu-block,
	Michael S. Tsirkin, Markus Armbruster, Max Reitz,
	Stefan Hajnoczi, fangying1, John Snow

Insert failed requests into device's list for later retry and handle
queued requests to implement retry_request_cb.

Signed-off-by: Jiahui Cen <cenjiahui@huawei.com>
Signed-off-by: Ying Fang <fangying1@huawei.com>
---
 hw/block/virtio-blk.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index bac2d6fa2b..cf8b350eaf 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -108,6 +108,10 @@ static int virtio_blk_handle_rw_error(VirtIOBlockReq *req, int error,
             block_acct_failed(blk_get_stats(s->blk), &req->acct);
         }
         virtio_blk_free_request(req);
+    } else if (action == BLOCK_ERROR_ACTION_RETRY) {
+        req->mr_next = NULL;
+        req->next = s->rq;
+        s->rq = req;
     }
 
     blk_error_action(s->blk, action, is_read, error);
@@ -149,6 +153,7 @@ static void virtio_blk_rw_complete(void *opaque, int ret)
             }
         }
 
+        blk_error_retry_reset_timeout(s->blk);
         virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
         block_acct_done(blk_get_stats(s->blk), &req->acct);
         virtio_blk_free_request(req);
@@ -828,12 +833,12 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 
 void virtio_blk_process_queued_requests(VirtIOBlock *s, bool is_bh)
 {
-    VirtIOBlockReq *req = s->rq;
+    VirtIOBlockReq *req;
     MultiReqBuffer mrb = {};
 
-    s->rq = NULL;
-
     aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
+    req = s->rq;
+    s->rq = NULL;
     while (req) {
         VirtIOBlockReq *next = req->next;
         if (virtio_blk_handle_request(req, &mrb)) {
@@ -1134,8 +1139,16 @@ static void virtio_blk_resize(void *opaque)
     aio_bh_schedule_oneshot(qemu_get_aio_context(), virtio_resize_cb, vdev);
 }
 
+static void virtio_blk_retry_request(void *opaque)
+{
+    VirtIOBlock *s = VIRTIO_BLK(opaque);
+
+    virtio_blk_process_queued_requests(s, false);
+}
+
 static const BlockDevOps virtio_block_ops = {
     .resize_cb = virtio_blk_resize,
+    .retry_request_cb = virtio_blk_retry_request,
 };
 
 static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 0/7] block: Add retry for werror=/rerror= mechanism
  2020-12-15 12:30 [PATCH v4 0/7] block: Add retry for werror=/rerror= mechanism Jiahui Cen
                   ` (6 preceding siblings ...)
  2020-12-15 12:30 ` [PATCH v4 7/7] virtio_blk: Add support for retry on errors Jiahui Cen
@ 2020-12-21  7:57 ` Jiahui Cen
  2021-01-05  9:33 ` Ping: " Jiahui Cen
  2021-01-25  3:23 ` Ying Fang
  9 siblings, 0 replies; 12+ messages in thread
From: Jiahui Cen @ 2020-12-21  7:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, zhang.zhanghailiang, qemu-block, Michael S. Tsirkin,
	Markus Armbruster, Max Reitz, Stefan Hajnoczi, fangying1,
	John Snow

Kindly ping...

On 2020/12/15 20:30, Jiahui Cen wrote:
> A VM in the cloud environment may use a virutal disk as the backend storage,
> and there are usually filesystems on the virtual block device. When backend
> storage is temporarily down, any I/O issued to the virtual block device
> will cause an error. For example, an error occurred in ext4 filesystem would
> make the filesystem readonly. In production environment, a cloud backend
> storage can be soon recovered. For example, an IP-SAN may be down due to
> network failure and will be online soon after network is recovered. However,
> the error in the filesystem may not be recovered unless a device reattach
> or system restart. Thus an I/O retry mechanism is in need to implement a
> self-healing system.
> 
> This patch series propose to extend the werror=/rerror= mechanism to add
> a 'retry' feature. It can automatically retry failed I/O requests on error
> without sending error back to guest, and guest can get back running smoothly
> when I/O is recovred.
> 
> v3->v4:
> * Adapt to werror=/rerror= mechanism.
> 
> v2->v3:
> * Add a doc to describe I/O hang.
> 
> v1->v2:
> * Rebase to fix compile problems.
> * Fix incorrect remove of rehandle list.
> * Provide rehandle pause interface.
> 
> REF: https://lists.gnu.org/archive/html/qemu-devel/2020-10/msg06560.html
> 
> Signed-off-by: Jiahui Cen <cenjiahui@huawei.com>
> Signed-off-by: Ying Fang <fangying1@huawei.com>
> 
> Jiahui Cen (7):
>   qapi/block-core: Add retry option for error action
>   block-backend: Introduce retry timer
>   block-backend: Add device specific retry callback
>   block-backend: Enable retry action on errors
>   block-backend: Add timeout support for retry
>   block: Add error retry param setting
>   virtio_blk: Add support for retry on errors
> 
>  block/block-backend.c          | 66 ++++++++++++++++++++
>  blockdev.c                     | 52 +++++++++++++++
>  hw/block/block.c               | 10 +++
>  hw/block/virtio-blk.c          | 19 +++++-
>  include/hw/block/block.h       |  7 ++-
>  include/sysemu/block-backend.h | 10 +++
>  qapi/block-core.json           |  4 +-
>  7 files changed, 162 insertions(+), 6 deletions(-)
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Ping: [PATCH v4 0/7] block: Add retry for werror=/rerror= mechanism
  2020-12-15 12:30 [PATCH v4 0/7] block: Add retry for werror=/rerror= mechanism Jiahui Cen
                   ` (7 preceding siblings ...)
  2020-12-21  7:57 ` [PATCH v4 0/7] block: Add retry for werror=/rerror= mechanism Jiahui Cen
@ 2021-01-05  9:33 ` Jiahui Cen
  2021-01-25  3:23 ` Ying Fang
  9 siblings, 0 replies; 12+ messages in thread
From: Jiahui Cen @ 2021-01-05  9:33 UTC (permalink / raw)
  To: qemu-devel, Kevin Wolf
  Cc: zhang.zhanghailiang, qemu-block, Michael S. Tsirkin,
	Markus Armbruster, Max Reitz, Stefan Hajnoczi, fangying1,
	John Snow

Hi Kevin,

What do you think of these patches?

Thanks,
Jiahui

On 2020/12/15 20:30, Jiahui Cen wrote:
> A VM in the cloud environment may use a virutal disk as the backend storage,
> and there are usually filesystems on the virtual block device. When backend
> storage is temporarily down, any I/O issued to the virtual block device
> will cause an error. For example, an error occurred in ext4 filesystem would
> make the filesystem readonly. In production environment, a cloud backend
> storage can be soon recovered. For example, an IP-SAN may be down due to
> network failure and will be online soon after network is recovered. However,
> the error in the filesystem may not be recovered unless a device reattach
> or system restart. Thus an I/O retry mechanism is in need to implement a
> self-healing system.
> 
> This patch series propose to extend the werror=/rerror= mechanism to add
> a 'retry' feature. It can automatically retry failed I/O requests on error
> without sending error back to guest, and guest can get back running smoothly
> when I/O is recovred.
> 
> v3->v4:
> * Adapt to werror=/rerror= mechanism.
> 
> v2->v3:
> * Add a doc to describe I/O hang.
> 
> v1->v2:
> * Rebase to fix compile problems.
> * Fix incorrect remove of rehandle list.
> * Provide rehandle pause interface.
> 
> REF: https://lists.gnu.org/archive/html/qemu-devel/2020-10/msg06560.html
> 
> Signed-off-by: Jiahui Cen <cenjiahui@huawei.com>
> Signed-off-by: Ying Fang <fangying1@huawei.com>
> 
> Jiahui Cen (7):
>   qapi/block-core: Add retry option for error action
>   block-backend: Introduce retry timer
>   block-backend: Add device specific retry callback
>   block-backend: Enable retry action on errors
>   block-backend: Add timeout support for retry
>   block: Add error retry param setting
>   virtio_blk: Add support for retry on errors
> 
>  block/block-backend.c          | 66 ++++++++++++++++++++
>  blockdev.c                     | 52 +++++++++++++++
>  hw/block/block.c               | 10 +++
>  hw/block/virtio-blk.c          | 19 +++++-
>  include/hw/block/block.h       |  7 ++-
>  include/sysemu/block-backend.h | 10 +++
>  qapi/block-core.json           |  4 +-
>  7 files changed, 162 insertions(+), 6 deletions(-)
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 0/7] block: Add retry for werror=/rerror= mechanism
  2020-12-15 12:30 [PATCH v4 0/7] block: Add retry for werror=/rerror= mechanism Jiahui Cen
                   ` (8 preceding siblings ...)
  2021-01-05  9:33 ` Ping: " Jiahui Cen
@ 2021-01-25  3:23 ` Ying Fang
  9 siblings, 0 replies; 12+ messages in thread
From: Ying Fang @ 2021-01-25  3:23 UTC (permalink / raw)
  To: Jiahui Cen, qemu-devel
  Cc: Kevin Wolf, zhang.zhanghailiang, qemu-block, Michael S. Tsirkin,
	Markus Armbruster, Max Reitz, Stefan Hajnoczi, John Snow

Kindly ping for it.

Thanks for Stefan's suggestion, we have re-implement the concept by
introducing the 'retry' feature base on the werror=/rerror= mechanism.

Hope this thread won't be missed. Any comments and reviews are wellcome.

Thanks.
Ying Fang.

On 12/15/2020 8:30 PM, Jiahui Cen wrote:
> A VM in the cloud environment may use a virutal disk as the backend storage,
> and there are usually filesystems on the virtual block device. When backend
> storage is temporarily down, any I/O issued to the virtual block device
> will cause an error. For example, an error occurred in ext4 filesystem would
> make the filesystem readonly. In production environment, a cloud backend
> storage can be soon recovered. For example, an IP-SAN may be down due to
> network failure and will be online soon after network is recovered. However,
> the error in the filesystem may not be recovered unless a device reattach
> or system restart. Thus an I/O retry mechanism is in need to implement a
> self-healing system.
> 
> This patch series propose to extend the werror=/rerror= mechanism to add
> a 'retry' feature. It can automatically retry failed I/O requests on error
> without sending error back to guest, and guest can get back running smoothly
> when I/O is recovred.
> 
> v3->v4:
> * Adapt to werror=/rerror= mechanism.
> 
> v2->v3:
> * Add a doc to describe I/O hang.
> 
> v1->v2:
> * Rebase to fix compile problems.
> * Fix incorrect remove of rehandle list.
> * Provide rehandle pause interface.
> 
> REF: https://lists.gnu.org/archive/html/qemu-devel/2020-10/msg06560.html
> 
> Signed-off-by: Jiahui Cen <cenjiahui@huawei.com>
> Signed-off-by: Ying Fang <fangying1@huawei.com>
> 
> Jiahui Cen (7):
>    qapi/block-core: Add retry option for error action
>    block-backend: Introduce retry timer
>    block-backend: Add device specific retry callback
>    block-backend: Enable retry action on errors
>    block-backend: Add timeout support for retry
>    block: Add error retry param setting
>    virtio_blk: Add support for retry on errors
> 
>   block/block-backend.c          | 66 ++++++++++++++++++++
>   blockdev.c                     | 52 +++++++++++++++
>   hw/block/block.c               | 10 +++
>   hw/block/virtio-blk.c          | 19 +++++-
>   include/hw/block/block.h       |  7 ++-
>   include/sysemu/block-backend.h | 10 +++
>   qapi/block-core.json           |  4 +-
>   7 files changed, 162 insertions(+), 6 deletions(-)
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 1/7] qapi/block-core: Add retry option for error action
  2020-12-15 12:30 ` [PATCH v4 1/7] qapi/block-core: Add retry option for error action Jiahui Cen
@ 2021-01-27 17:16   ` Eric Blake
  0 siblings, 0 replies; 12+ messages in thread
From: Eric Blake @ 2021-01-27 17:16 UTC (permalink / raw)
  To: Jiahui Cen, qemu-devel
  Cc: Kevin Wolf, zhang.zhanghailiang, qemu-block, Michael S. Tsirkin,
	Markus Armbruster, Max Reitz, Stefan Hajnoczi, fangying1,
	John Snow

On 12/15/20 6:30 AM, Jiahui Cen wrote:
> Add a new error action 'retry' to support retry on errors.
> 
> Signed-off-by: Jiahui Cen <cenjiahui@huawei.com>
> Signed-off-by: Ying Fang <fangying1@huawei.com>
> ---
>  blockdev.c           | 2 ++
>  qapi/block-core.json | 4 ++--
>  2 files changed, 4 insertions(+), 2 deletions(-)

> +++ b/qapi/block-core.json
> @@ -1146,7 +1146,7 @@
>  # Since: 1.3
>  ##
>  { 'enum': 'BlockdevOnError',
> -  'data': ['report', 'ignore', 'enospc', 'stop', 'auto'] }
> +  'data': ['report', 'ignore', 'enospc', 'stop', 'auto', 'retry'] }

Missing a documentation line that 'retry' was added in 6.0.

>  
>  ##
>  # @MirrorSyncMode:
> @@ -4770,7 +4770,7 @@
>  # Since: 2.1
>  ##
>  { 'enum': 'BlockErrorAction',
> -  'data': [ 'ignore', 'report', 'stop' ] }
> +  'data': [ 'ignore', 'report', 'stop', 'retry' ] }

Likewise.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-01-27 17:19 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-15 12:30 [PATCH v4 0/7] block: Add retry for werror=/rerror= mechanism Jiahui Cen
2020-12-15 12:30 ` [PATCH v4 1/7] qapi/block-core: Add retry option for error action Jiahui Cen
2021-01-27 17:16   ` Eric Blake
2020-12-15 12:30 ` [PATCH v4 2/7] block-backend: Introduce retry timer Jiahui Cen
2020-12-15 12:30 ` [PATCH v4 3/7] block-backend: Add device specific retry callback Jiahui Cen
2020-12-15 12:30 ` [PATCH v4 4/7] block-backend: Enable retry action on errors Jiahui Cen
2020-12-15 12:30 ` [PATCH v4 5/7] block-backend: Add timeout support for retry Jiahui Cen
2020-12-15 12:30 ` [PATCH v4 6/7] block: Add error retry param setting Jiahui Cen
2020-12-15 12:30 ` [PATCH v4 7/7] virtio_blk: Add support for retry on errors Jiahui Cen
2020-12-21  7:57 ` [PATCH v4 0/7] block: Add retry for werror=/rerror= mechanism Jiahui Cen
2021-01-05  9:33 ` Ping: " Jiahui Cen
2021-01-25  3:23 ` Ying Fang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).