All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/9] block: Add retry for werror=/rerror= mechanism
@ 2021-02-05 10:13 Jiahui Cen
  2021-02-05 10:13 ` [PATCH v5 1/9] qapi/block-core: Add retry option for error action Jiahui Cen
                   ` (11 more replies)
  0 siblings, 12 replies; 21+ messages in thread
From: Jiahui Cen @ 2021-02-05 10:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, cenjiahui, zhang.zhanghailiang, qemu-block,
	Michael S. Tsirkin, Markus Armbruster, Max Reitz,
	Stefan Hajnoczi, fangying1, John Snow

A VM in the cloud environment may use a virutal disk as the backend storage,
and there are usually filesystems on the virtual block device. When backend
storage is temporarily down, any I/O issued to the virtual block device
will cause an error. For example, an error occurred in ext4 filesystem would
make the filesystem readonly. In production environment, a cloud backend
storage can be soon recovered. For example, an IP-SAN may be down due to
network failure and will be online soon after network is recovered. However,
the error in the filesystem may not be recovered unless a device reattach
or system restart. Thus an I/O retry mechanism is in need to implement a
self-healing system.

This patch series propose to extend the werror=/rerror= mechanism to add
a 'retry' feature. It can automatically retry failed I/O requests on error
without sending error back to guest, and guest can get back running smoothly
when I/O is recovred.

v4->v5:
* Add document for 'retry' in qapi.
* Support werror=/rerror=retry for scsi-disk.
* Pause retry when draining.

v3->v4:
* Adapt to werror=/rerror= mechanism.

v2->v3:
* Add a doc to describe I/O hang.

v1->v2:
* Rebase to fix compile problems.
* Fix incorrect remove of rehandle list.
* Provide rehandle pause interface.

REF: https://lists.gnu.org/archive/html/qemu-devel/2020-10/msg06560.html

Jiahui Cen (9):
  qapi/block-core: Add retry option for error action
  block-backend: Introduce retry timer
  block-backend: Add device specific retry callback
  block-backend: Enable retry action on errors
  block-backend: Add timeout support for retry
  block: Add error retry param setting
  virtio_blk: Add support for retry on errors
  scsi-bus: Refactor the code that retries requests
  scsi-disk: Add support for retry on errors

 block/block-backend.c          | 68 ++++++++++++++++++++
 blockdev.c                     | 52 +++++++++++++++
 hw/block/block.c               | 10 +++
 hw/block/virtio-blk.c          | 21 +++++-
 hw/scsi/scsi-bus.c             | 16 +++--
 hw/scsi/scsi-disk.c            | 16 +++++
 include/hw/block/block.h       |  7 +-
 include/hw/scsi/scsi.h         |  1 +
 include/sysemu/block-backend.h | 10 +++
 qapi/block-core.json           |  9 ++-
 10 files changed, 199 insertions(+), 11 deletions(-)

-- 
2.29.2



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v5 1/9] qapi/block-core: Add retry option for error action
  2021-02-05 10:13 [PATCH v5 0/9] block: Add retry for werror=/rerror= mechanism Jiahui Cen
@ 2021-02-05 10:13 ` Jiahui Cen
  2021-02-22 17:22   ` Stefan Hajnoczi
  2021-02-05 10:13 ` [PATCH v5 2/9] block-backend: Introduce retry timer Jiahui Cen
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 21+ messages in thread
From: Jiahui Cen @ 2021-02-05 10:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, cenjiahui, zhang.zhanghailiang, qemu-block,
	Michael S. Tsirkin, Markus Armbruster, Max Reitz,
	Stefan Hajnoczi, fangying1, John Snow

Add a new error action 'retry' to support retry on errors.

Signed-off-by: Jiahui Cen <cenjiahui@huawei.com>
Signed-off-by: Ying Fang <fangying1@huawei.com>
---
 blockdev.c           | 2 ++
 qapi/block-core.json | 9 +++++++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index b250b9b959..ece1d8ae58 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -342,6 +342,8 @@ static int parse_block_error_action(const char *buf, bool is_read, Error **errp)
         return BLOCKDEV_ON_ERROR_STOP;
     } else if (!strcmp(buf, "report")) {
         return BLOCKDEV_ON_ERROR_REPORT;
+    } else if (!strcmp(buf, "retry")) {
+        return BLOCKDEV_ON_ERROR_RETRY;
     } else {
         error_setg(errp, "'%s' invalid %s error action",
                    buf, is_read ? "read" : "write");
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 9f555d5c1d..30ea43cb77 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1143,10 +1143,13 @@
 #
 # @auto: inherit the error handling policy of the backend (since: 2.7)
 #
+# @retry: for guest operations, retry the failing request; (since: 6.0)
+#         for jobs, not supported
+#
 # Since: 1.3
 ##
 { 'enum': 'BlockdevOnError',
-  'data': ['report', 'ignore', 'enospc', 'stop', 'auto'] }
+  'data': ['report', 'ignore', 'enospc', 'stop', 'auto', 'retry'] }
 
 ##
 # @MirrorSyncMode:
@@ -4839,10 +4842,12 @@
 #
 # @stop: error caused VM to be stopped
 #
+# @retry: error has been retried (since: 6.0)
+#
 # Since: 2.1
 ##
 { 'enum': 'BlockErrorAction',
-  'data': [ 'ignore', 'report', 'stop' ] }
+  'data': [ 'ignore', 'report', 'stop', 'retry' ] }
 
 
 ##
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v5 2/9] block-backend: Introduce retry timer
  2021-02-05 10:13 [PATCH v5 0/9] block: Add retry for werror=/rerror= mechanism Jiahui Cen
  2021-02-05 10:13 ` [PATCH v5 1/9] qapi/block-core: Add retry option for error action Jiahui Cen
@ 2021-02-05 10:13 ` Jiahui Cen
  2021-02-05 10:13 ` [PATCH v5 3/9] block-backend: Add device specific retry callback Jiahui Cen
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Jiahui Cen @ 2021-02-05 10:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, cenjiahui, zhang.zhanghailiang, qemu-block,
	Michael S. Tsirkin, Markus Armbruster, Max Reitz,
	Stefan Hajnoczi, fangying1, John Snow

Add a timer to regularly trigger retry on errors.

Signed-off-by: Jiahui Cen <cenjiahui@huawei.com>
Signed-off-by: Ying Fang <fangying1@huawei.com>
---
 block/block-backend.c | 21 ++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/block/block-backend.c b/block/block-backend.c
index e493f17515..3a9d55cbe3 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -35,6 +35,9 @@
 
 static AioContext *blk_aiocb_get_aio_context(BlockAIOCB *acb);
 
+/* block backend default retry interval */
+#define BLOCK_BACKEND_DEFAULT_RETRY_INTERVAL   1000
+
 typedef struct BlockBackendAioNotifier {
     void (*attached_aio_context)(AioContext *new_context, void *opaque);
     void (*detach_aio_context)(void *opaque);
@@ -95,6 +98,15 @@ struct BlockBackend {
      * Accessed with atomic ops.
      */
     unsigned int in_flight;
+
+    /* Timer for retry on errors. */
+    QEMUTimer *retry_timer;
+    /* Interval in ms to trigger next retry. */
+    int64_t retry_interval;
+    /* Start time of the first error. Used to check timeout. */
+    int64_t retry_start_time;
+    /* Retry timeout. 0 represents infinite retry. */
+    int64_t retry_timeout;
 };
 
 typedef struct BlockBackendAIOCB {
@@ -345,6 +357,11 @@ BlockBackend *blk_new(AioContext *ctx, uint64_t perm, uint64_t shared_perm)
     blk->on_read_error = BLOCKDEV_ON_ERROR_REPORT;
     blk->on_write_error = BLOCKDEV_ON_ERROR_ENOSPC;
 
+    blk->retry_timer = NULL;
+    blk->retry_interval = BLOCK_BACKEND_DEFAULT_RETRY_INTERVAL;
+    blk->retry_start_time = 0;
+    blk->retry_timeout = 0;
+
     block_acct_init(&blk->stats);
 
     qemu_co_queue_init(&blk->queued_requests);
@@ -456,6 +473,10 @@ static void blk_delete(BlockBackend *blk)
     QTAILQ_REMOVE(&block_backends, blk, link);
     drive_info_del(blk->legacy_dinfo);
     block_acct_cleanup(&blk->stats);
+    if (blk->retry_timer) {
+        timer_del(blk->retry_timer);
+        timer_free(blk->retry_timer);
+    }
     g_free(blk);
 }
 
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v5 3/9] block-backend: Add device specific retry callback
  2021-02-05 10:13 [PATCH v5 0/9] block: Add retry for werror=/rerror= mechanism Jiahui Cen
  2021-02-05 10:13 ` [PATCH v5 1/9] qapi/block-core: Add retry option for error action Jiahui Cen
  2021-02-05 10:13 ` [PATCH v5 2/9] block-backend: Introduce retry timer Jiahui Cen
@ 2021-02-05 10:13 ` Jiahui Cen
  2021-02-22 17:24   ` Stefan Hajnoczi
  2021-02-05 10:13 ` [PATCH v5 4/9] block-backend: Enable retry action on errors Jiahui Cen
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 21+ messages in thread
From: Jiahui Cen @ 2021-02-05 10:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, cenjiahui, zhang.zhanghailiang, qemu-block,
	Michael S. Tsirkin, Markus Armbruster, Max Reitz,
	Stefan Hajnoczi, fangying1, John Snow

Add retry_request_cb in BlockDevOps to do device specific retry action.
Backend's timer would be registered only when the backend is set 'retry'
on errors and the device supports retry action.

Signed-off-by: Jiahui Cen <cenjiahui@huawei.com>
Signed-off-by: Ying Fang <fangying1@huawei.com>
---
 block/block-backend.c          | 8 ++++++++
 include/sysemu/block-backend.h | 4 ++++
 2 files changed, 12 insertions(+)

diff --git a/block/block-backend.c b/block/block-backend.c
index 3a9d55cbe3..a8bfaf6e4a 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -995,6 +995,14 @@ void blk_set_dev_ops(BlockBackend *blk, const BlockDevOps *ops,
     blk->dev_ops = ops;
     blk->dev_opaque = opaque;
 
+    if ((blk->on_read_error == BLOCKDEV_ON_ERROR_RETRY ||
+         blk->on_write_error == BLOCKDEV_ON_ERROR_RETRY) &&
+        ops->retry_request_cb) {
+        blk->retry_timer = aio_timer_new(blk->ctx, QEMU_CLOCK_REALTIME,
+                                         SCALE_MS, ops->retry_request_cb,
+                                         opaque);
+    }
+
     /* Are we currently quiesced? Should we enforce this right now? */
     if (blk->quiesce_counter && ops->drained_begin) {
         ops->drained_begin(opaque);
diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index 880e903293..d806852db5 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -66,6 +66,10 @@ typedef struct BlockDevOps {
      * Runs when the backend's last drain request ends.
      */
     void (*drained_end)(void *opaque);
+    /*
+     * Runs when retrying failed requests.
+     */
+    void (*retry_request_cb)(void *opaque);
 } BlockDevOps;
 
 /* This struct is embedded in (the private) BlockBackend struct and contains
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v5 4/9] block-backend: Enable retry action on errors
  2021-02-05 10:13 [PATCH v5 0/9] block: Add retry for werror=/rerror= mechanism Jiahui Cen
                   ` (2 preceding siblings ...)
  2021-02-05 10:13 ` [PATCH v5 3/9] block-backend: Add device specific retry callback Jiahui Cen
@ 2021-02-05 10:13 ` Jiahui Cen
  2021-02-22 17:12   ` Stefan Hajnoczi
  2021-02-05 10:13 ` [PATCH v5 5/9] block-backend: Add timeout support for retry Jiahui Cen
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 21+ messages in thread
From: Jiahui Cen @ 2021-02-05 10:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, cenjiahui, zhang.zhanghailiang, qemu-block,
	Michael S. Tsirkin, Markus Armbruster, Max Reitz,
	Stefan Hajnoczi, fangying1, John Snow

Enable retry action when backend's retry timer is available. It would
trigger the timer to do device specific retry action.

Signed-off-by: Jiahui Cen <cenjiahui@huawei.com>
Signed-off-by: Ying Fang <fangying1@huawei.com>
---
 block/block-backend.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/block/block-backend.c b/block/block-backend.c
index a8bfaf6e4a..ab75cb1971 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1803,6 +1803,9 @@ BlockErrorAction blk_get_error_action(BlockBackend *blk, bool is_read,
         return BLOCK_ERROR_ACTION_REPORT;
     case BLOCKDEV_ON_ERROR_IGNORE:
         return BLOCK_ERROR_ACTION_IGNORE;
+    case BLOCKDEV_ON_ERROR_RETRY:
+        return (blk->retry_timer) ?
+               BLOCK_ERROR_ACTION_RETRY : BLOCK_ERROR_ACTION_REPORT;
     case BLOCKDEV_ON_ERROR_AUTO:
     default:
         abort();
@@ -1850,6 +1853,12 @@ void blk_error_action(BlockBackend *blk, BlockErrorAction action,
         qemu_system_vmstop_request_prepare();
         send_qmp_error_event(blk, action, is_read, error);
         qemu_system_vmstop_request(RUN_STATE_IO_ERROR);
+    } else if (action == BLOCK_ERROR_ACTION_RETRY) {
+        if (!blk->quiesce_counter) {
+            timer_mod(blk->retry_timer, qemu_clock_get_ms(QEMU_CLOCK_REALTIME) +
+                                        blk->retry_interval);
+            send_qmp_error_event(blk, action, is_read, error);
+        }
     } else {
         send_qmp_error_event(blk, action, is_read, error);
     }
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v5 5/9] block-backend: Add timeout support for retry
  2021-02-05 10:13 [PATCH v5 0/9] block: Add retry for werror=/rerror= mechanism Jiahui Cen
                   ` (3 preceding siblings ...)
  2021-02-05 10:13 ` [PATCH v5 4/9] block-backend: Enable retry action on errors Jiahui Cen
@ 2021-02-05 10:13 ` Jiahui Cen
  2021-02-05 10:13 ` [PATCH v5 6/9] block: Add error retry param setting Jiahui Cen
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Jiahui Cen @ 2021-02-05 10:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, cenjiahui, zhang.zhanghailiang, qemu-block,
	Michael S. Tsirkin, Markus Armbruster, Max Reitz,
	Stefan Hajnoczi, fangying1, John Snow

Retry should only be triggered when timeout is not reached, so let's check
timeout before retry. Device should also reset retry_start_time after
successful retry.

Signed-off-by: Jiahui Cen <cenjiahui@huawei.com>
Signed-off-by: Ying Fang <fangying1@huawei.com>
---
 block/block-backend.c          | 25 +++++++++++++++++++-
 include/sysemu/block-backend.h |  1 +
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index ab75cb1971..8ad1e5105d 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1776,6 +1776,29 @@ void blk_drain_all(void)
     bdrv_drain_all_end();
 }
 
+static bool blk_error_retry_timeout(BlockBackend *blk)
+{
+    /* No timeout set, infinite retries. */
+    if (!blk->retry_timeout) {
+        return false;
+    }
+
+    /* The first time an error occurs. */
+    if (!blk->retry_start_time) {
+        blk->retry_start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+        return false;
+    }
+
+    return qemu_clock_get_ms(QEMU_CLOCK_REALTIME) > (blk->retry_start_time +
+                                                     blk->retry_timeout);
+}
+
+void blk_error_retry_reset_timeout(BlockBackend *blk)
+{
+    if (blk->retry_timer && blk->retry_start_time)
+        blk->retry_start_time = 0;
+}
+
 void blk_set_on_error(BlockBackend *blk, BlockdevOnError on_read_error,
                       BlockdevOnError on_write_error)
 {
@@ -1804,7 +1827,7 @@ BlockErrorAction blk_get_error_action(BlockBackend *blk, bool is_read,
     case BLOCKDEV_ON_ERROR_IGNORE:
         return BLOCK_ERROR_ACTION_IGNORE;
     case BLOCKDEV_ON_ERROR_RETRY:
-        return (blk->retry_timer) ?
+        return (blk->retry_timer && !blk_error_retry_timeout(blk)) ?
                BLOCK_ERROR_ACTION_RETRY : BLOCK_ERROR_ACTION_REPORT;
     case BLOCKDEV_ON_ERROR_AUTO:
     default:
diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index d806852db5..286d2db918 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -188,6 +188,7 @@ void blk_inc_in_flight(BlockBackend *blk);
 void blk_dec_in_flight(BlockBackend *blk);
 void blk_drain(BlockBackend *blk);
 void blk_drain_all(void);
+void blk_error_retry_reset_timeout(BlockBackend *blk);
 void blk_set_on_error(BlockBackend *blk, BlockdevOnError on_read_error,
                       BlockdevOnError on_write_error);
 BlockdevOnError blk_get_on_error(BlockBackend *blk, bool is_read);
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v5 6/9] block: Add error retry param setting
  2021-02-05 10:13 [PATCH v5 0/9] block: Add retry for werror=/rerror= mechanism Jiahui Cen
                   ` (4 preceding siblings ...)
  2021-02-05 10:13 ` [PATCH v5 5/9] block-backend: Add timeout support for retry Jiahui Cen
@ 2021-02-05 10:13 ` Jiahui Cen
  2021-02-22 17:16   ` Stefan Hajnoczi
  2021-02-05 10:13 ` [PATCH v5 7/9] virtio_blk: Add support for retry on errors Jiahui Cen
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 21+ messages in thread
From: Jiahui Cen @ 2021-02-05 10:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, cenjiahui, zhang.zhanghailiang, qemu-block,
	Michael S. Tsirkin, Markus Armbruster, Max Reitz,
	Stefan Hajnoczi, fangying1, John Snow

Add "retry_interval" and "retry_timeout" parameter for drive and device
option. These parameter are valid only when werror/rerror=retry.

eg. --drive file=image,rerror=retry,retry_interval=1000,retry_timeout=5000

Signed-off-by: Jiahui Cen <cenjiahui@huawei.com>
Signed-off-by: Ying Fang <fangying1@huawei.com>
---
 block/block-backend.c          | 13 +++--
 blockdev.c                     | 50 ++++++++++++++++++++
 hw/block/block.c               | 10 ++++
 include/hw/block/block.h       |  7 ++-
 include/sysemu/block-backend.h |  5 ++
 5 files changed, 81 insertions(+), 4 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 8ad1e5105d..b97aba7110 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -35,9 +35,6 @@
 
 static AioContext *blk_aiocb_get_aio_context(BlockAIOCB *acb);
 
-/* block backend default retry interval */
-#define BLOCK_BACKEND_DEFAULT_RETRY_INTERVAL   1000
-
 typedef struct BlockBackendAioNotifier {
     void (*attached_aio_context)(AioContext *new_context, void *opaque);
     void (*detach_aio_context)(void *opaque);
@@ -1776,6 +1773,16 @@ void blk_drain_all(void)
     bdrv_drain_all_end();
 }
 
+void blk_set_on_error_retry_interval(BlockBackend *blk, int64_t interval)
+{
+    blk->retry_interval = interval;
+}
+
+void blk_set_on_error_retry_timeout(BlockBackend *blk, int64_t timeout)
+{
+    blk->retry_timeout = timeout;
+}
+
 static bool blk_error_retry_timeout(BlockBackend *blk)
 {
     /* No timeout set, infinite retries. */
diff --git a/blockdev.c b/blockdev.c
index ece1d8ae58..9b2cfdef78 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -489,6 +489,7 @@ static BlockBackend *blockdev_init(const char *file, QDict *bs_opts,
     const char *buf;
     int bdrv_flags = 0;
     int on_read_error, on_write_error;
+    int64_t retry_interval, retry_timeout;
     bool account_invalid, account_failed;
     bool writethrough, read_only;
     BlockBackend *blk;
@@ -581,6 +582,10 @@ static BlockBackend *blockdev_init(const char *file, QDict *bs_opts,
         }
     }
 
+    retry_interval = qemu_opt_get_number(opts, "retry_interval",
+                                         BLOCK_BACKEND_DEFAULT_RETRY_INTERVAL);
+    retry_timeout = qemu_opt_get_number(opts, "retry_timeout", 0);
+
     if (snapshot) {
         bdrv_flags |= BDRV_O_SNAPSHOT;
     }
@@ -645,6 +650,11 @@ static BlockBackend *blockdev_init(const char *file, QDict *bs_opts,
 
     blk_set_enable_write_cache(blk, !writethrough);
     blk_set_on_error(blk, on_read_error, on_write_error);
+    if (on_read_error == BLOCKDEV_ON_ERROR_RETRY ||
+        on_write_error == BLOCKDEV_ON_ERROR_RETRY) {
+        blk_set_on_error_retry_interval(blk, retry_interval);
+        blk_set_on_error_retry_timeout(blk, retry_timeout);
+    }
 
     if (!monitor_add_blk(blk, id, errp)) {
         blk_unref(blk);
@@ -771,6 +781,14 @@ QemuOptsList qemu_legacy_drive_opts = {
             .name = "werror",
             .type = QEMU_OPT_STRING,
             .help = "write error action",
+        },{
+            .name = "retry_interval",
+            .type = QEMU_OPT_NUMBER,
+            .help = "interval for retry action in millisecond",
+        },{
+            .name = "retry_timeout",
+            .type = QEMU_OPT_NUMBER,
+            .help = "timeout for retry action in millisecond",
         },{
             .name = "copy-on-read",
             .type = QEMU_OPT_BOOL,
@@ -793,6 +811,7 @@ DriveInfo *drive_new(QemuOpts *all_opts, BlockInterfaceType block_default_type,
     BlockInterfaceType type;
     int max_devs, bus_id, unit_id, index;
     const char *werror, *rerror;
+    int64_t retry_interval, retry_timeout;
     bool read_only = false;
     bool copy_on_read;
     const char *filename;
@@ -1004,6 +1023,29 @@ DriveInfo *drive_new(QemuOpts *all_opts, BlockInterfaceType block_default_type,
         qdict_put_str(bs_opts, "rerror", rerror);
     }
 
+    if (qemu_opt_find(legacy_opts, "retry_interval")) {
+        if ((werror == NULL || strcmp(werror, "retry")) &&
+            (rerror == NULL || strcmp(rerror, "retry"))) {
+            error_setg(errp, "retry_interval is only supported "
+                             "by werror/rerror=retry");
+            goto fail;
+        }
+        retry_interval = qemu_opt_get_number(legacy_opts, "retry_interval",
+                             BLOCK_BACKEND_DEFAULT_RETRY_INTERVAL);
+        qdict_put_int(bs_opts, "retry_interval", retry_interval);
+    }
+
+    if (qemu_opt_find(legacy_opts, "retry_timeout")) {
+        if ((werror == NULL || strcmp(werror, "retry")) &&
+            (rerror == NULL || strcmp(rerror, "retry"))) {
+            error_setg(errp, "retry_timeout is only supported "
+                             "by werror/rerror=retry");
+            goto fail;
+        }
+        retry_timeout = qemu_opt_get_number(legacy_opts, "retry_timeout", 0);
+        qdict_put_int(bs_opts, "retry_timeout", retry_timeout);
+    }
+
     /* Actual block device init: Functionality shared with blockdev-add */
     blk = blockdev_init(filename, bs_opts, errp);
     bs_opts = NULL;
@@ -3820,6 +3862,14 @@ QemuOptsList qemu_common_drive_opts = {
             .name = "werror",
             .type = QEMU_OPT_STRING,
             .help = "write error action",
+        },{
+            .name = "retry_interval",
+            .type = QEMU_OPT_NUMBER,
+            .help = "interval for retry action in millisecond",
+        },{
+            .name = "retry_timeout",
+            .type = QEMU_OPT_NUMBER,
+            .help = "timeout for retry action in millisecond",
         },{
             .name = BDRV_OPT_READ_ONLY,
             .type = QEMU_OPT_BOOL,
diff --git a/hw/block/block.c b/hw/block/block.c
index 1e34573da7..d2f35dc465 100644
--- a/hw/block/block.c
+++ b/hw/block/block.c
@@ -172,6 +172,16 @@ bool blkconf_apply_backend_options(BlockConf *conf, bool readonly,
     blk_set_enable_write_cache(blk, wce);
     blk_set_on_error(blk, rerror, werror);
 
+    if (rerror == BLOCKDEV_ON_ERROR_RETRY ||
+        werror == BLOCKDEV_ON_ERROR_RETRY) {
+        if (conf->retry_interval >= 0) {
+            blk_set_on_error_retry_interval(blk, conf->retry_interval);
+        }
+        if (conf->retry_timeout >= 0) {
+            blk_set_on_error_retry_timeout(blk, conf->retry_timeout);
+        }
+    }
+
     return true;
 }
 
diff --git a/include/hw/block/block.h b/include/hw/block/block.h
index c172cbe65f..a8e77fae68 100644
--- a/include/hw/block/block.h
+++ b/include/hw/block/block.h
@@ -32,6 +32,8 @@ typedef struct BlockConf {
     bool share_rw;
     BlockdevOnError rerror;
     BlockdevOnError werror;
+    int64_t retry_interval;
+    int64_t retry_timeout;
 } BlockConf;
 
 static inline unsigned int get_physical_block_exp(BlockConf *conf)
@@ -76,7 +78,10 @@ static inline unsigned int get_physical_block_exp(BlockConf *conf)
     DEFINE_PROP_BLOCKDEV_ON_ERROR("rerror", _state, _conf.rerror,       \
                                   BLOCKDEV_ON_ERROR_AUTO),              \
     DEFINE_PROP_BLOCKDEV_ON_ERROR("werror", _state, _conf.werror,       \
-                                  BLOCKDEV_ON_ERROR_AUTO)
+                                  BLOCKDEV_ON_ERROR_AUTO),              \
+    DEFINE_PROP_INT64("retry_interval", _state, _conf.retry_interval,   \
+                      -1),                                              \
+    DEFINE_PROP_INT64("retry_timeout", _state, _conf.retry_timeout, -1)
 
 /* Backend access helpers */
 
diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index 286d2db918..aecfa25b33 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -25,6 +25,9 @@
  */
 #include "block/block.h"
 
+/* block backend default retry interval */
+#define BLOCK_BACKEND_DEFAULT_RETRY_INTERVAL   1000
+
 /* Callbacks for block device models */
 typedef struct BlockDevOps {
     /*
@@ -188,6 +191,8 @@ void blk_inc_in_flight(BlockBackend *blk);
 void blk_dec_in_flight(BlockBackend *blk);
 void blk_drain(BlockBackend *blk);
 void blk_drain_all(void);
+void blk_set_on_error_retry_interval(BlockBackend *blk, int64_t interval);
+void blk_set_on_error_retry_timeout(BlockBackend *blk, int64_t timeout);
 void blk_error_retry_reset_timeout(BlockBackend *blk);
 void blk_set_on_error(BlockBackend *blk, BlockdevOnError on_read_error,
                       BlockdevOnError on_write_error);
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v5 7/9] virtio_blk: Add support for retry on errors
  2021-02-05 10:13 [PATCH v5 0/9] block: Add retry for werror=/rerror= mechanism Jiahui Cen
                   ` (5 preceding siblings ...)
  2021-02-05 10:13 ` [PATCH v5 6/9] block: Add error retry param setting Jiahui Cen
@ 2021-02-05 10:13 ` Jiahui Cen
  2021-02-05 10:13 ` [PATCH v5 8/9] scsi-bus: Refactor the code that retries requests Jiahui Cen
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Jiahui Cen @ 2021-02-05 10:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, cenjiahui, zhang.zhanghailiang, qemu-block,
	Michael S. Tsirkin, Markus Armbruster, Max Reitz,
	Stefan Hajnoczi, fangying1, John Snow

Insert failed requests into device's list for later retry and handle
queued requests to implement retry_request_cb.

Signed-off-by: Jiahui Cen <cenjiahui@huawei.com>
Signed-off-by: Ying Fang <fangying1@huawei.com>
---
 hw/block/virtio-blk.c | 21 +++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index e8600b069d..58f098fb9c 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -108,6 +108,10 @@ static int virtio_blk_handle_rw_error(VirtIOBlockReq *req, int error,
             block_acct_failed(blk_get_stats(s->blk), &req->acct);
         }
         virtio_blk_free_request(req);
+    } else if (action == BLOCK_ERROR_ACTION_RETRY) {
+        req->mr_next = NULL;
+        req->next = s->rq;
+        s->rq = req;
     }
 
     blk_error_action(s->blk, action, is_read, error);
@@ -149,6 +153,7 @@ static void virtio_blk_rw_complete(void *opaque, int ret)
             }
         }
 
+        blk_error_retry_reset_timeout(s->blk);
         virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
         block_acct_done(blk_get_stats(s->blk), &req->acct);
         virtio_blk_free_request(req);
@@ -168,6 +173,7 @@ static void virtio_blk_flush_complete(void *opaque, int ret)
         }
     }
 
+    blk_error_retry_reset_timeout(s->blk);
     virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
     block_acct_done(blk_get_stats(s->blk), &req->acct);
     virtio_blk_free_request(req);
@@ -190,6 +196,7 @@ static void virtio_blk_discard_write_zeroes_complete(void *opaque, int ret)
         }
     }
 
+    blk_error_retry_reset_timeout(s->blk);
     virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
     if (is_write_zeroes) {
         block_acct_done(blk_get_stats(s->blk), &req->acct);
@@ -828,12 +835,12 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 
 void virtio_blk_process_queued_requests(VirtIOBlock *s, bool is_bh)
 {
-    VirtIOBlockReq *req = s->rq;
+    VirtIOBlockReq *req;
     MultiReqBuffer mrb = {};
 
-    s->rq = NULL;
-
     aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
+    req = s->rq;
+    s->rq = NULL;
     while (req) {
         VirtIOBlockReq *next = req->next;
         if (virtio_blk_handle_request(req, &mrb)) {
@@ -1134,8 +1141,16 @@ static void virtio_blk_resize(void *opaque)
     aio_bh_schedule_oneshot(qemu_get_aio_context(), virtio_resize_cb, vdev);
 }
 
+static void virtio_blk_retry_request(void *opaque)
+{
+    VirtIOBlock *s = VIRTIO_BLK(opaque);
+
+    virtio_blk_process_queued_requests(s, false);
+}
+
 static const BlockDevOps virtio_block_ops = {
     .resize_cb = virtio_blk_resize,
+    .retry_request_cb = virtio_blk_retry_request,
 };
 
 static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v5 8/9] scsi-bus: Refactor the code that retries requests
  2021-02-05 10:13 [PATCH v5 0/9] block: Add retry for werror=/rerror= mechanism Jiahui Cen
                   ` (6 preceding siblings ...)
  2021-02-05 10:13 ` [PATCH v5 7/9] virtio_blk: Add support for retry on errors Jiahui Cen
@ 2021-02-05 10:13 ` Jiahui Cen
  2021-02-05 10:13 ` [PATCH v5 9/9] scsi-disk: Add support for retry on errors Jiahui Cen
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Jiahui Cen @ 2021-02-05 10:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, cenjiahui, zhang.zhanghailiang, qemu-block,
	Michael S. Tsirkin, Markus Armbruster, Max Reitz,
	Stefan Hajnoczi, fangying1, John Snow

Move the code that retries requests from scsi_dma_restart_bh() to its own,
non-static, function. This will allow us to call it from the
retry_request_cb() of scsi-disk in a future patch.

Signed-off-by: Jiahui Cen <cenjiahui@huawei.com>
Signed-off-by: Ying Fang <fangying1@huawei.com>
---
 hw/scsi/scsi-bus.c     | 16 +++++++++++-----
 include/hw/scsi/scsi.h |  1 +
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index c349fb7f2d..b2a174efe2 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -143,14 +143,10 @@ void scsi_bus_new(SCSIBus *bus, size_t bus_size, DeviceState *host,
     qbus_set_bus_hotplug_handler(BUS(bus));
 }
 
-static void scsi_dma_restart_bh(void *opaque)
+void scsi_retry_requests(SCSIDevice *s)
 {
-    SCSIDevice *s = opaque;
     SCSIRequest *req, *next;
 
-    qemu_bh_delete(s->bh);
-    s->bh = NULL;
-
     aio_context_acquire(blk_get_aio_context(s->conf.blk));
     QTAILQ_FOREACH_SAFE(req, &s->requests, next, next) {
         scsi_req_ref(req);
@@ -170,6 +166,16 @@ static void scsi_dma_restart_bh(void *opaque)
         scsi_req_unref(req);
     }
     aio_context_release(blk_get_aio_context(s->conf.blk));
+}
+
+static void scsi_dma_restart_bh(void *opaque)
+{
+    SCSIDevice *s = opaque;
+
+    qemu_bh_delete(s->bh);
+    s->bh = NULL;
+
+    scsi_retry_requests(s);
     /* Drop the reference that was acquired in scsi_dma_restart_cb */
     object_unref(OBJECT(s));
 }
diff --git a/include/hw/scsi/scsi.h b/include/hw/scsi/scsi.h
index 09fa5c9d2a..28f330deaf 100644
--- a/include/hw/scsi/scsi.h
+++ b/include/hw/scsi/scsi.h
@@ -181,6 +181,7 @@ void scsi_req_cancel_complete(SCSIRequest *req);
 void scsi_req_cancel(SCSIRequest *req);
 void scsi_req_cancel_async(SCSIRequest *req, Notifier *notifier);
 void scsi_req_retry(SCSIRequest *req);
+void scsi_retry_requests(SCSIDevice *s);
 void scsi_device_purge_requests(SCSIDevice *sdev, SCSISense sense);
 void scsi_device_set_ua(SCSIDevice *sdev, SCSISense sense);
 void scsi_device_report_change(SCSIDevice *dev, SCSISense sense);
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v5 9/9] scsi-disk: Add support for retry on errors
  2021-02-05 10:13 [PATCH v5 0/9] block: Add retry for werror=/rerror= mechanism Jiahui Cen
                   ` (7 preceding siblings ...)
  2021-02-05 10:13 ` [PATCH v5 8/9] scsi-bus: Refactor the code that retries requests Jiahui Cen
@ 2021-02-05 10:13 ` Jiahui Cen
  2021-02-10  1:22 ` [PATCH v5 0/9] block: Add retry for werror=/rerror= mechanism Jiahui Cen
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Jiahui Cen @ 2021-02-05 10:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, cenjiahui, zhang.zhanghailiang, qemu-block,
	Michael S. Tsirkin, Markus Armbruster, Max Reitz,
	Stefan Hajnoczi, fangying1, John Snow

Mark failed requests as to be retried and implement retry_request_cb to
handle these requests.

Signed-off-by: Jiahui Cen <cenjiahui@huawei.com>
Signed-off-by: Ying Fang <fangying1@huawei.com>
---
 hw/scsi/scsi-disk.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index ed52fcd49f..687ed19822 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -184,6 +184,8 @@ static void scsi_disk_load_request(QEMUFile *f, SCSIRequest *req)
 
 static bool scsi_disk_req_check_error(SCSIDiskReq *r, int ret, bool acct_failed)
 {
+    SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
+
     if (r->req.io_canceled) {
         scsi_req_cancel_complete(&r->req);
         return true;
@@ -193,6 +195,7 @@ static bool scsi_disk_req_check_error(SCSIDiskReq *r, int ret, bool acct_failed)
         return scsi_handle_rw_error(r, -ret, acct_failed);
     }
 
+    blk_error_retry_reset_timeout(s->qdev.conf.blk);
     return false;
 }
 
@@ -499,6 +502,10 @@ static bool scsi_handle_rw_error(SCSIDiskReq *r, int error, bool acct_failed)
         }
     }
 
+    if (action == BLOCK_ERROR_ACTION_RETRY) {
+        scsi_req_retry(&r->req);
+    }
+
     blk_error_action(s->qdev.conf.blk, action, is_read, error);
     if (action == BLOCK_ERROR_ACTION_IGNORE) {
         scsi_req_complete(&r->req, 0);
@@ -2284,6 +2291,13 @@ static void scsi_disk_resize_cb(void *opaque)
     }
 }
 
+static void scsi_disk_retry_request(void *opaque)
+{
+    SCSIDiskState *s = opaque;
+
+    scsi_retry_requests(&s->qdev);
+}
+
 static void scsi_cd_change_media_cb(void *opaque, bool load, Error **errp)
 {
     SCSIDiskState *s = opaque;
@@ -2332,10 +2346,12 @@ static const BlockDevOps scsi_disk_removable_block_ops = {
     .is_medium_locked = scsi_cd_is_medium_locked,
 
     .resize_cb = scsi_disk_resize_cb,
+    .retry_request_cb = scsi_disk_retry_request,
 };
 
 static const BlockDevOps scsi_disk_block_ops = {
     .resize_cb = scsi_disk_resize_cb,
+    .retry_request_cb = scsi_disk_retry_request,
 };
 
 static void scsi_disk_unit_attention_reported(SCSIDevice *dev)
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 0/9] block: Add retry for werror=/rerror= mechanism
  2021-02-05 10:13 [PATCH v5 0/9] block: Add retry for werror=/rerror= mechanism Jiahui Cen
                   ` (8 preceding siblings ...)
  2021-02-05 10:13 ` [PATCH v5 9/9] scsi-disk: Add support for retry on errors Jiahui Cen
@ 2021-02-10  1:22 ` Jiahui Cen
  2021-02-22 11:00   ` Stefan Hajnoczi
  2021-02-22 17:25 ` Stefan Hajnoczi
  2021-02-23  9:40 ` Stefan Hajnoczi
  11 siblings, 1 reply; 21+ messages in thread
From: Jiahui Cen @ 2021-02-10  1:22 UTC (permalink / raw)
  To: qemu-devel, Kevin Wolf, Stefan Hajnoczi
  Cc: zhang.zhanghailiang, qemu-block, Michael S. Tsirkin, John Snow,
	Markus Armbruster, Max Reitz, fangying1

Kindly ping.
Any comments and reviews are wellcome :)

Thanks,
Jiahui

On 2021/2/5 18:13, Jiahui Cen wrote:
> A VM in the cloud environment may use a virutal disk as the backend storage,
> and there are usually filesystems on the virtual block device. When backend
> storage is temporarily down, any I/O issued to the virtual block device
> will cause an error. For example, an error occurred in ext4 filesystem would
> make the filesystem readonly. In production environment, a cloud backend
> storage can be soon recovered. For example, an IP-SAN may be down due to
> network failure and will be online soon after network is recovered. However,
> the error in the filesystem may not be recovered unless a device reattach
> or system restart. Thus an I/O retry mechanism is in need to implement a
> self-healing system.
> 
> This patch series propose to extend the werror=/rerror= mechanism to add
> a 'retry' feature. It can automatically retry failed I/O requests on error
> without sending error back to guest, and guest can get back running smoothly
> when I/O is recovred.
> 
> v4->v5:
> * Add document for 'retry' in qapi.
> * Support werror=/rerror=retry for scsi-disk.
> * Pause retry when draining.
> 
> v3->v4:
> * Adapt to werror=/rerror= mechanism.
> 
> v2->v3:
> * Add a doc to describe I/O hang.
> 
> v1->v2:
> * Rebase to fix compile problems.
> * Fix incorrect remove of rehandle list.
> * Provide rehandle pause interface.
> 
> REF: https://lists.gnu.org/archive/html/qemu-devel/2020-10/msg06560.html
> 
> Jiahui Cen (9):
>   qapi/block-core: Add retry option for error action
>   block-backend: Introduce retry timer
>   block-backend: Add device specific retry callback
>   block-backend: Enable retry action on errors
>   block-backend: Add timeout support for retry
>   block: Add error retry param setting
>   virtio_blk: Add support for retry on errors
>   scsi-bus: Refactor the code that retries requests
>   scsi-disk: Add support for retry on errors
> 
>  block/block-backend.c          | 68 ++++++++++++++++++++
>  blockdev.c                     | 52 +++++++++++++++
>  hw/block/block.c               | 10 +++
>  hw/block/virtio-blk.c          | 21 +++++-
>  hw/scsi/scsi-bus.c             | 16 +++--
>  hw/scsi/scsi-disk.c            | 16 +++++
>  include/hw/block/block.h       |  7 +-
>  include/hw/scsi/scsi.h         |  1 +
>  include/sysemu/block-backend.h | 10 +++
>  qapi/block-core.json           |  9 ++-
>  10 files changed, 199 insertions(+), 11 deletions(-)
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 0/9] block: Add retry for werror=/rerror= mechanism
  2021-02-10  1:22 ` [PATCH v5 0/9] block: Add retry for werror=/rerror= mechanism Jiahui Cen
@ 2021-02-22 11:00   ` Stefan Hajnoczi
  0 siblings, 0 replies; 21+ messages in thread
From: Stefan Hajnoczi @ 2021-02-22 11:00 UTC (permalink / raw)
  To: Jiahui Cen
  Cc: Kevin Wolf, zhang.zhanghailiang, qemu-block, Michael S. Tsirkin,
	Markus Armbruster, qemu-devel, fangying1, Max Reitz, John Snow

[-- Attachment #1: Type: text/plain, Size: 286 bytes --]

On Wed, Feb 10, 2021 at 09:22:38AM +0800, Jiahui Cen wrote:
> Kindly ping.
> Any comments and reviews are wellcome :)

I lost track of this series. Sorry!

Feel free to ping me on #qemu IRC (my nick is "stefanha") if I'm not
responding to emails.

I will review the series now.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 4/9] block-backend: Enable retry action on errors
  2021-02-05 10:13 ` [PATCH v5 4/9] block-backend: Enable retry action on errors Jiahui Cen
@ 2021-02-22 17:12   ` Stefan Hajnoczi
  0 siblings, 0 replies; 21+ messages in thread
From: Stefan Hajnoczi @ 2021-02-22 17:12 UTC (permalink / raw)
  To: Jiahui Cen
  Cc: Kevin Wolf, zhang.zhanghailiang, qemu-block, Michael S. Tsirkin,
	Markus Armbruster, qemu-devel, fangying1, Max Reitz, John Snow

[-- Attachment #1: Type: text/plain, Size: 1702 bytes --]

On Fri, Feb 05, 2021 at 06:13:10PM +0800, Jiahui Cen wrote:
> Enable retry action when backend's retry timer is available. It would
> trigger the timer to do device specific retry action.
> 
> Signed-off-by: Jiahui Cen <cenjiahui@huawei.com>
> Signed-off-by: Ying Fang <fangying1@huawei.com>
> ---
>  block/block-backend.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/block/block-backend.c b/block/block-backend.c
> index a8bfaf6e4a..ab75cb1971 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -1803,6 +1803,9 @@ BlockErrorAction blk_get_error_action(BlockBackend *blk, bool is_read,
>          return BLOCK_ERROR_ACTION_REPORT;
>      case BLOCKDEV_ON_ERROR_IGNORE:
>          return BLOCK_ERROR_ACTION_IGNORE;
> +    case BLOCKDEV_ON_ERROR_RETRY:
> +        return (blk->retry_timer) ?
> +               BLOCK_ERROR_ACTION_RETRY : BLOCK_ERROR_ACTION_REPORT;
>      case BLOCKDEV_ON_ERROR_AUTO:
>      default:
>          abort();
> @@ -1850,6 +1853,12 @@ void blk_error_action(BlockBackend *blk, BlockErrorAction action,
>          qemu_system_vmstop_request_prepare();
>          send_qmp_error_event(blk, action, is_read, error);
>          qemu_system_vmstop_request(RUN_STATE_IO_ERROR);
> +    } else if (action == BLOCK_ERROR_ACTION_RETRY) {
> +        if (!blk->quiesce_counter) {
> +            timer_mod(blk->retry_timer, qemu_clock_get_ms(QEMU_CLOCK_REALTIME) +
> +                                        blk->retry_interval);

QEMU should not make any guest-visible changes while the guest is
stopped (vm_stop()). Please use QEMU_CLOCK_VIRTUAL so the timer does not
fire while the guest is stopped.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 6/9] block: Add error retry param setting
  2021-02-05 10:13 ` [PATCH v5 6/9] block: Add error retry param setting Jiahui Cen
@ 2021-02-22 17:16   ` Stefan Hajnoczi
  0 siblings, 0 replies; 21+ messages in thread
From: Stefan Hajnoczi @ 2021-02-22 17:16 UTC (permalink / raw)
  To: Jiahui Cen
  Cc: Kevin Wolf, zhang.zhanghailiang, qemu-block, Michael S. Tsirkin,
	Markus Armbruster, qemu-devel, fangying1, Max Reitz, John Snow

[-- Attachment #1: Type: text/plain, Size: 521 bytes --]

On Fri, Feb 05, 2021 at 06:13:12PM +0800, Jiahui Cen wrote:
> @@ -581,6 +582,10 @@ static BlockBackend *blockdev_init(const char *file, QDict *bs_opts,
>          }
>      }
>  
> +    retry_interval = qemu_opt_get_number(opts, "retry_interval",
> +                                         BLOCK_BACKEND_DEFAULT_RETRY_INTERVAL);
> +    retry_timeout = qemu_opt_get_number(opts, "retry_timeout", 0);

Please use "retry-interval" and "retry-timeout". "-" is used more often
for command-line parameters than "_".

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 1/9] qapi/block-core: Add retry option for error action
  2021-02-05 10:13 ` [PATCH v5 1/9] qapi/block-core: Add retry option for error action Jiahui Cen
@ 2021-02-22 17:22   ` Stefan Hajnoczi
  0 siblings, 0 replies; 21+ messages in thread
From: Stefan Hajnoczi @ 2021-02-22 17:22 UTC (permalink / raw)
  To: Jiahui Cen
  Cc: Kevin Wolf, zhang.zhanghailiang, qemu-block, Michael S. Tsirkin,
	Markus Armbruster, qemu-devel, fangying1, Max Reitz, John Snow

[-- Attachment #1: Type: text/plain, Size: 1538 bytes --]

On Fri, Feb 05, 2021 at 06:13:07PM +0800, Jiahui Cen wrote:
> Add a new error action 'retry' to support retry on errors.
> 
> Signed-off-by: Jiahui Cen <cenjiahui@huawei.com>
> Signed-off-by: Ying Fang <fangying1@huawei.com>
> ---
>  blockdev.c           | 2 ++
>  qapi/block-core.json | 9 +++++++--
>  2 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/blockdev.c b/blockdev.c
> index b250b9b959..ece1d8ae58 100644
> --- a/blockdev.c
> +++ b/blockdev.c
> @@ -342,6 +342,8 @@ static int parse_block_error_action(const char *buf, bool is_read, Error **errp)
>          return BLOCKDEV_ON_ERROR_STOP;
>      } else if (!strcmp(buf, "report")) {
>          return BLOCKDEV_ON_ERROR_REPORT;
> +    } else if (!strcmp(buf, "retry")) {
> +        return BLOCKDEV_ON_ERROR_RETRY;
>      } else {
>          error_setg(errp, "'%s' invalid %s error action",
>                     buf, is_read ? "read" : "write");
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index 9f555d5c1d..30ea43cb77 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -1143,10 +1143,13 @@
>  #
>  # @auto: inherit the error handling policy of the backend (since: 2.7)
>  #
> +# @retry: for guest operations, retry the failing request; (since: 6.0)
> +#         for jobs, not supported

Does this mean block_job_error_action() can now reach abort() in switch
(on_err)? If yes, please add a check that reports an error when "retry"
is specified so that abort() cannot be reached.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 3/9] block-backend: Add device specific retry callback
  2021-02-05 10:13 ` [PATCH v5 3/9] block-backend: Add device specific retry callback Jiahui Cen
@ 2021-02-22 17:24   ` Stefan Hajnoczi
  0 siblings, 0 replies; 21+ messages in thread
From: Stefan Hajnoczi @ 2021-02-22 17:24 UTC (permalink / raw)
  To: Jiahui Cen
  Cc: Kevin Wolf, zhang.zhanghailiang, qemu-block, Michael S. Tsirkin,
	Markus Armbruster, qemu-devel, fangying1, Max Reitz, John Snow

[-- Attachment #1: Type: text/plain, Size: 1441 bytes --]

On Fri, Feb 05, 2021 at 06:13:09PM +0800, Jiahui Cen wrote:
> Add retry_request_cb in BlockDevOps to do device specific retry action.
> Backend's timer would be registered only when the backend is set 'retry'
> on errors and the device supports retry action.
> 
> Signed-off-by: Jiahui Cen <cenjiahui@huawei.com>
> Signed-off-by: Ying Fang <fangying1@huawei.com>
> ---
>  block/block-backend.c          | 8 ++++++++
>  include/sysemu/block-backend.h | 4 ++++
>  2 files changed, 12 insertions(+)
> 
> diff --git a/block/block-backend.c b/block/block-backend.c
> index 3a9d55cbe3..a8bfaf6e4a 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -995,6 +995,14 @@ void blk_set_dev_ops(BlockBackend *blk, const BlockDevOps *ops,
>      blk->dev_ops = ops;
>      blk->dev_opaque = opaque;
>  
> +    if ((blk->on_read_error == BLOCKDEV_ON_ERROR_RETRY ||
> +         blk->on_write_error == BLOCKDEV_ON_ERROR_RETRY) &&
> +        ops->retry_request_cb) {
> +        blk->retry_timer = aio_timer_new(blk->ctx, QEMU_CLOCK_REALTIME,
> +                                         SCALE_MS, ops->retry_request_cb,
> +                                         opaque);
> +    }

I didn't see anything that handles AioContext changes. If the
BlockBackend is detached from the current AioContext and attached to a
different one, then it is necessary to delete the old timer and create
a new one.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 0/9] block: Add retry for werror=/rerror= mechanism
  2021-02-05 10:13 [PATCH v5 0/9] block: Add retry for werror=/rerror= mechanism Jiahui Cen
                   ` (9 preceding siblings ...)
  2021-02-10  1:22 ` [PATCH v5 0/9] block: Add retry for werror=/rerror= mechanism Jiahui Cen
@ 2021-02-22 17:25 ` Stefan Hajnoczi
  2021-02-23  9:40 ` Stefan Hajnoczi
  11 siblings, 0 replies; 21+ messages in thread
From: Stefan Hajnoczi @ 2021-02-22 17:25 UTC (permalink / raw)
  To: Jiahui Cen
  Cc: Kevin Wolf, zhang.zhanghailiang, qemu-block, Michael S. Tsirkin,
	Markus Armbruster, qemu-devel, fangying1, Max Reitz, John Snow

[-- Attachment #1: Type: text/plain, Size: 391 bytes --]

On Fri, Feb 05, 2021 at 06:13:06PM +0800, Jiahui Cen wrote:
> This patch series propose to extend the werror=/rerror= mechanism to add
> a 'retry' feature. It can automatically retry failed I/O requests on error
> without sending error back to guest, and guest can get back running smoothly
> when I/O is recovred.

I posted a few questions but overall this looks promising. Thanks!

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 0/9] block: Add retry for werror=/rerror= mechanism
  2021-02-05 10:13 [PATCH v5 0/9] block: Add retry for werror=/rerror= mechanism Jiahui Cen
                   ` (10 preceding siblings ...)
  2021-02-22 17:25 ` Stefan Hajnoczi
@ 2021-02-23  9:40 ` Stefan Hajnoczi
  2021-02-23 10:20   ` Jiahui Cen
  2021-02-23 13:41   ` Eric Blake
  11 siblings, 2 replies; 21+ messages in thread
From: Stefan Hajnoczi @ 2021-02-23  9:40 UTC (permalink / raw)
  To: Jiahui Cen
  Cc: Kevin Wolf, zhang.zhanghailiang, qemu-block, Michael S. Tsirkin,
	Markus Armbruster, qemu-devel, fangying1, Max Reitz, John Snow

[-- Attachment #1: Type: text/plain, Size: 1292 bytes --]

On Fri, Feb 05, 2021 at 06:13:06PM +0800, Jiahui Cen wrote:
> This patch series propose to extend the werror=/rerror= mechanism to add
> a 'retry' feature. It can automatically retry failed I/O requests on error
> without sending error back to guest, and guest can get back running smoothly
> when I/O is recovred.

This patch series implements a retry followed by werror/rerror=report
after a timeout. This mechanism could be made more generic (and the code
could be simplified) by removing the new werror/rerror=retry action and
instead implementing the retry/timeout followed by *any* werror=/rerror=
policy chosen by the user.

In other words, if the retry interval is non-zero, retry the request and
check for timeouts. When the timeout is reached, obey the
werror=/rerror= action.

This is more flexible than hard-coding werror=retry to mean retry
timeout followed by werror=report.

For example:

  werror=stop,write-retry-interval=1000,write-retry-timeout=15000,
  rerror=report,read-retry-interval=1000,read-retry-timeout=15000

Failed write requests will be retried once a second for 15 seconds.
If the timeout is reached the guest is stopped.

Failed read requests will be retried once a second for 15 seconds. If
the timeout is reached the error is reported to the guest.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 0/9] block: Add retry for werror=/rerror= mechanism
  2021-02-23  9:40 ` Stefan Hajnoczi
@ 2021-02-23 10:20   ` Jiahui Cen
  2021-02-23 13:41   ` Eric Blake
  1 sibling, 0 replies; 21+ messages in thread
From: Jiahui Cen @ 2021-02-23 10:20 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, zhang.zhanghailiang, qemu-block, Michael S. Tsirkin,
	Markus Armbruster, qemu-devel, fangying1, Max Reitz, John Snow

Hi Stefan,

On 2021/2/23 17:40, Stefan Hajnoczi wrote:
> On Fri, Feb 05, 2021 at 06:13:06PM +0800, Jiahui Cen wrote:
>> This patch series propose to extend the werror=/rerror= mechanism to add
>> a 'retry' feature. It can automatically retry failed I/O requests on error
>> without sending error back to guest, and guest can get back running smoothly
>> when I/O is recovred.
> 
> This patch series implements a retry followed by werror/rerror=report
> after a timeout. This mechanism could be made more generic (and the code
> could be simplified) by removing the new werror/rerror=retry action and
> instead implementing the retry/timeout followed by *any* werror=/rerror=
> policy chosen by the user.
> 
> In other words, if the retry interval is non-zero, retry the request and
> check for timeouts. When the timeout is reached, obey the
> werror=/rerror= action.
> 
> This is more flexible than hard-coding werror=retry to mean retry
> timeout followed by werror=report.
> 
> For example:
> 
>   werror=stop,write-retry-interval=1000,write-retry-timeout=15000,
>   rerror=report,read-retry-interval=1000,read-retry-timeout=15000
> 
> Failed write requests will be retried once a second for 15 seconds.
> If the timeout is reached the guest is stopped.
> 
> Failed read requests will be retried once a second for 15 seconds. If
> the timeout is reached the error is reported to the guest.

Sounds like a better way for me. I'll refactor this patch series under
your suggestion.

Also thanks for your review.

Thanks,
Jiahui


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 0/9] block: Add retry for werror=/rerror= mechanism
  2021-02-23  9:40 ` Stefan Hajnoczi
  2021-02-23 10:20   ` Jiahui Cen
@ 2021-02-23 13:41   ` Eric Blake
  2021-02-25 17:24     ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 21+ messages in thread
From: Eric Blake @ 2021-02-23 13:41 UTC (permalink / raw)
  To: Stefan Hajnoczi, Jiahui Cen
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, zhang.zhanghailiang,
	qemu-block, Michael S. Tsirkin, Markus Armbruster, qemu-devel,
	fangying1, Max Reitz, John Snow

On 2/23/21 3:40 AM, Stefan Hajnoczi wrote:
> On Fri, Feb 05, 2021 at 06:13:06PM +0800, Jiahui Cen wrote:
>> This patch series propose to extend the werror=/rerror= mechanism to add
>> a 'retry' feature. It can automatically retry failed I/O requests on error
>> without sending error back to guest, and guest can get back running smoothly
>> when I/O is recovred.
> 
> This patch series implements a retry followed by werror/rerror=report
> after a timeout. This mechanism could be made more generic (and the code
> could be simplified) by removing the new werror/rerror=retry action and
> instead implementing the retry/timeout followed by *any* werror=/rerror=
> policy chosen by the user.
> 
> In other words, if the retry interval is non-zero, retry the request and
> check for timeouts. When the timeout is reached, obey the
> werror=/rerror= action.
> 
> This is more flexible than hard-coding werror=retry to mean retry
> timeout followed by werror=report.
> 
> For example:
> 
>   werror=stop,write-retry-interval=1000,write-retry-timeout=15000,
>   rerror=report,read-retry-interval=1000,read-retry-timeout=15000
> 
> Failed write requests will be retried once a second for 15 seconds.
> If the timeout is reached the guest is stopped.
> 
> Failed read requests will be retried once a second for 15 seconds. If
> the timeout is reached the error is reported to the guest.

You may also want to look at what the NBD block device already
implements for retries, and see if making retry generic to the block
layer in general can do everything already possible in the NBD code, at
which point the NBD code can be simplified.  Vladimir (added in cc) is
the best point of contact there.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 0/9] block: Add retry for werror=/rerror= mechanism
  2021-02-23 13:41   ` Eric Blake
@ 2021-02-25 17:24     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 21+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-02-25 17:24 UTC (permalink / raw)
  To: Eric Blake, Stefan Hajnoczi, Jiahui Cen
  Cc: qemu-devel, qemu-block, Kevin Wolf, Max Reitz, Markus Armbruster,
	John Snow, Michael S. Tsirkin, zhang.zhanghailiang, fangying1

23.02.2021 16:41, Eric Blake wrote:
> On 2/23/21 3:40 AM, Stefan Hajnoczi wrote:
>> On Fri, Feb 05, 2021 at 06:13:06PM +0800, Jiahui Cen wrote:
>>> This patch series propose to extend the werror=/rerror= mechanism to add
>>> a 'retry' feature. It can automatically retry failed I/O requests on error
>>> without sending error back to guest, and guest can get back running smoothly
>>> when I/O is recovred.
>>
>> This patch series implements a retry followed by werror/rerror=report
>> after a timeout. This mechanism could be made more generic (and the code
>> could be simplified) by removing the new werror/rerror=retry action and
>> instead implementing the retry/timeout followed by *any* werror=/rerror=
>> policy chosen by the user.
>>
>> In other words, if the retry interval is non-zero, retry the request and
>> check for timeouts. When the timeout is reached, obey the
>> werror=/rerror= action.
>>
>> This is more flexible than hard-coding werror=retry to mean retry
>> timeout followed by werror=report.
>>
>> For example:
>>
>>    werror=stop,write-retry-interval=1000,write-retry-timeout=15000,
>>    rerror=report,read-retry-interval=1000,read-retry-timeout=15000
>>
>> Failed write requests will be retried once a second for 15 seconds.
>> If the timeout is reached the guest is stopped.
>>
>> Failed read requests will be retried once a second for 15 seconds. If
>> the timeout is reached the error is reported to the guest.
> 
> You may also want to look at what the NBD block device already
> implements for retries, and see if making retry generic to the block
> layer in general can do everything already possible in the NBD code, at
> which point the NBD code can be simplified.  Vladimir (added in cc) is
> the best point of contact there.
> 

Hi!

There are some specific things for retrying in NBD client code, that may not make sense for generic code:

We try to handle not some generic error but tcp connection problems. So

1. We NBD retries we wanted to retry only requests failed due to connection loss.
    So, if NBD server successfully reply with -EIO to client request, we don't retry and just report -EIO.
    Such things can be distinguished only inside NBD driver.

2. Timeout is specific too: we start timer when we detected the connection lost for the first time.
    During the specified time all current requests are waiting for reconnect, they'll be retried if
    connection established.
    On timeout all current requests (waiting for reconnect) will report failure, as well as all further
    requests will report failure immediately (until successful reconnect).

Still, of course, NBD driver is over-complicated by reconnect feature (especially with reconnect thread:) and
it would be cool to split some parts to be separate...


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-02-25 17:26 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-05 10:13 [PATCH v5 0/9] block: Add retry for werror=/rerror= mechanism Jiahui Cen
2021-02-05 10:13 ` [PATCH v5 1/9] qapi/block-core: Add retry option for error action Jiahui Cen
2021-02-22 17:22   ` Stefan Hajnoczi
2021-02-05 10:13 ` [PATCH v5 2/9] block-backend: Introduce retry timer Jiahui Cen
2021-02-05 10:13 ` [PATCH v5 3/9] block-backend: Add device specific retry callback Jiahui Cen
2021-02-22 17:24   ` Stefan Hajnoczi
2021-02-05 10:13 ` [PATCH v5 4/9] block-backend: Enable retry action on errors Jiahui Cen
2021-02-22 17:12   ` Stefan Hajnoczi
2021-02-05 10:13 ` [PATCH v5 5/9] block-backend: Add timeout support for retry Jiahui Cen
2021-02-05 10:13 ` [PATCH v5 6/9] block: Add error retry param setting Jiahui Cen
2021-02-22 17:16   ` Stefan Hajnoczi
2021-02-05 10:13 ` [PATCH v5 7/9] virtio_blk: Add support for retry on errors Jiahui Cen
2021-02-05 10:13 ` [PATCH v5 8/9] scsi-bus: Refactor the code that retries requests Jiahui Cen
2021-02-05 10:13 ` [PATCH v5 9/9] scsi-disk: Add support for retry on errors Jiahui Cen
2021-02-10  1:22 ` [PATCH v5 0/9] block: Add retry for werror=/rerror= mechanism Jiahui Cen
2021-02-22 11:00   ` Stefan Hajnoczi
2021-02-22 17:25 ` Stefan Hajnoczi
2021-02-23  9:40 ` Stefan Hajnoczi
2021-02-23 10:20   ` Jiahui Cen
2021-02-23 13:41   ` Eric Blake
2021-02-25 17:24     ` Vladimir Sementsov-Ogievskiy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.