qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/4] POC: Generating realistic block errors
@ 2019-09-19 19:48 Tony Asleson
  2019-09-19 19:48 ` [RFC 1/4] Add qapi for block error injection Tony Asleson
                   ` (7 more replies)
  0 siblings, 8 replies; 30+ messages in thread
From: Tony Asleson @ 2019-09-19 19:48 UTC (permalink / raw)
  To: qemu-devel, kwolf

For a long time I thought that VMs could be a great way to improve OS
code quality by modifying the simulated hardware to return errors to
exercise error paths in the OSs, specifically in block devices for right now.
A number of different approaches are available within the Linux kernel, eg.
scsi-debug, dm-flakey, and others.  However, I always though it would be best to
simulate it from the hardware.  To fully exercise the entire stack.  As a
bonus it's OS agnostic for those projects that cross OSs and it's available
before the OS even boots.

This POC needs a lot more work, but it's what I have so far.  Learning QEMU
internals, plus some of the different bus types has been interesting to say
the least.  I'm most familiar with SCSI, but the others are foreign to me.
AHCI/SATA/ATA is very interesting with it's history and the associated code to
handle it's evolution.

Eventually I think it would be useful to add functionality for errors on
write paths, timeouts, and different error behaviors.  Like media error(s) that
are corrected by a re-write (simulate failed write on power loss), bit rot
injection etc.  I know a number of these can be solved different ways,
but embracing them from the VM environment seems ideal to me.  Expanding
to gather statistics on IO patterns, histograms etc. could be very
useful too.  Having the ability to start/stop information collection and
then have access to what happened and in what order could allow for
better error injection of key FS structures and software RAID solutions too.

I've recently been informed that blkdebug exists.  From a cursory investigation
it does appear have overlap, but I haven't given it a try yet.  From looking
at the code to insert my changes it appears that some of the errors I'm
generating are different than what for example an EIO on a read_aio does, but
I'm not sure.  Perhaps some of the other features that I've outlined above
already exist too in some other capacity of QEMU?

Instead of working on this more in a vacuum I'm presenting what I have.  To
gauge interest, to see if others think it's as interesting as I do.  Or perhaps,
to find out that I've been re-inventing the wheel.

I'm interested in learning what thoughts people have on this.

Thanks,
Tony

Tony Asleson (4):
  Add qapi for block error injection
  SCSI media error reporting
  NVMe media error reporting
  ahci media error reporting

 block/Makefile.objs  |   2 +-
 block/error_inject.c | 179 +++++++++++++++++++++++++++++++++++++++++++
 block/error_inject.h |  43 +++++++++++
 block/qapi.c         |  18 +++++
 hw/block/nvme.c      |   8 ++
 hw/ide/ahci.c        |  27 +++++++
 hw/scsi/scsi-disk.c  |  33 ++++++++
 include/scsi/utils.h |   4 +
 qapi/block.json      |  36 +++++++++
 scsi/utils.c         |  31 ++++++++
 10 files changed, 380 insertions(+), 1 deletion(-)
 create mode 100644 block/error_inject.c
 create mode 100644 block/error_inject.h

-- 
2.21.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC 1/4] Add qapi for block error injection
  2019-09-19 19:48 [RFC 0/4] POC: Generating realistic block errors Tony Asleson
@ 2019-09-19 19:48 ` Tony Asleson
  2019-09-20  9:03   ` Philippe Mathieu-Daudé
  2019-09-19 19:48 ` [RFC 2/4] SCSI media error reporting Tony Asleson
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Tony Asleson @ 2019-09-19 19:48 UTC (permalink / raw)
  To: qemu-devel, kwolf

Proof of concept code to dynamically create/delete media errors
for block devices using the qapi interface.  This is useful for testing
the OS and the block and FS layers for error handling.  Utilizing this
in a VM allows it to be OS agnostic and test the OS at it's lowest levels
of hardware interaction.

Signed-off-by: Tony Asleson <tasleson@redhat.com>
---
 block/Makefile.objs  |   2 +-
 block/error_inject.c | 179 +++++++++++++++++++++++++++++++++++++++++++
 block/error_inject.h |  43 +++++++++++
 block/qapi.c         |  18 +++++
 qapi/block.json      |  36 +++++++++
 5 files changed, 277 insertions(+), 1 deletion(-)
 create mode 100644 block/error_inject.c
 create mode 100644 block/error_inject.h

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 35f3bca4d9..c8fcbc276b 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -1,4 +1,4 @@
-block-obj-y += raw-format.o vmdk.o vpc.o
+block-obj-y += raw-format.o vmdk.o vpc.o error_inject.o
 block-obj-$(CONFIG_QCOW1) += qcow.o
 block-obj-$(CONFIG_VDI) += vdi.o
 block-obj-$(CONFIG_CLOOP) += cloop.o
diff --git a/block/error_inject.c b/block/error_inject.c
new file mode 100644
index 0000000000..26a47dfb8c
--- /dev/null
+++ b/block/error_inject.c
@@ -0,0 +1,179 @@
+/*
+ * Error injection code for block devices
+ *
+ * Copyright (c) 2019 Red Hat, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#include "error_inject.h"
+
+#include <gmodule.h>
+
+#include "qemu/thread.h"
+
+static QemuMutex error_inject_lock;
+static GHashTable *error_inject_data;
+
+
+static void delete_lba_entries(void *entry)
+{
+    GSequence *e = (GSequence *) entry;
+    g_sequence_free(e);
+}
+
+struct value {
+    uint64_t error_lba;
+
+    /*
+     * TODO Actually do something with behavior
+     */
+    MediaErrorBehavior behavior;
+    /*
+     * TODO Add data for generating bitrot, when we do change free function
+     */
+};
+
+static int key_compare(gconstpointer a, gconstpointer b, gpointer data)
+{
+    uint64_t left = ((struct value *)a)->error_lba;
+    uint64_t right = ((struct value *)b)->error_lba;
+
+    if (left < right) {
+        return -1;
+    } else if (left > right) {
+        return 1;
+    } else {
+        return 0;
+    }
+}
+
+static uint64_t error_lba_get(GSequenceIter *iter)
+{
+    gpointer tmp = g_sequence_get(iter);
+    return ((struct value *)tmp)->error_lba;
+}
+
+void media_error_create(const char *device_id, uint64_t lba,
+                        MediaErrorBehavior behavior)
+{
+    qemu_mutex_lock(&error_inject_lock);
+
+    GSequence *block_device = g_hash_table_lookup(error_inject_data, device_id);
+    if (!block_device) {
+        block_device = g_sequence_new(g_free);
+        char *key = strdup(device_id);
+        g_hash_table_insert(error_inject_data, key, block_device);
+    }
+
+    struct value lookup = {lba, MEDIA_ERROR_BEHAVIOR__MAX};
+    if (!g_sequence_lookup(block_device, &lookup, key_compare, NULL)) {
+        struct value *val = g_new(struct value, 1);
+        val->error_lba = lba;
+        val->behavior = behavior;
+
+        g_sequence_insert_sorted(block_device, val, key_compare, NULL);
+    }
+
+    qemu_mutex_unlock(&error_inject_lock);
+}
+
+void media_error_delete(const char *device_id, uint64_t lba)
+{
+    qemu_mutex_lock(&error_inject_lock);
+
+    GSequence *block_device = g_hash_table_lookup(error_inject_data, device_id);
+    if (block_device) {
+        struct value find = { lba, MEDIA_ERROR_BEHAVIOR__MAX};
+        GSequenceIter *found = g_sequence_lookup(block_device, &find,
+                                                 key_compare, NULL);
+        if (found) {
+            g_sequence_remove(found);
+        }
+    }
+
+    qemu_mutex_unlock(&error_inject_lock);
+}
+
+int error_in_read(const char *device_id, uint64_t lba, uint64_t len,
+                  uint64_t *error_lba)
+{
+    uint64_t error_sector = 0;
+    const uint64_t transfer_end = lba + len;
+    int ec = 0;
+    *error_lba = 0xFFFFFFFFFFFFFFFF;
+
+    qemu_mutex_lock(&error_inject_lock);
+
+    GSequence *block_device = g_hash_table_lookup(error_inject_data, device_id);
+    if (block_device && g_sequence_get_length(block_device) != 0) {
+        struct value find = {lba, MEDIA_ERROR_BEHAVIOR__MAX};
+        GSequenceIter *iter = g_sequence_search(block_device, &find,
+                                                key_compare, NULL);
+
+        /*
+         * g_sequence_seach returns where the item would be inserted.
+         * In the case of a direct match, it's spot is inserted after the
+         * existing, thus we need to check the one immediately before
+         * the insertion point to see if it is the one we are looking for.
+         */
+        GSequenceIter *prev = g_sequence_iter_prev(iter);
+        error_sector = error_lba_get(prev);
+
+        if (error_sector >= lba && error_sector < transfer_end) {
+            *error_lba = error_sector;
+            ec = 1;
+        } else {
+            /*
+             * Lets look at next until we find one in our transfer or bail
+             * if the error(s) logical block address are greater than the
+             * end of our transfer.
+             */
+            while (!g_sequence_iter_is_end(iter)) {
+                error_sector = error_lba_get(iter);
+
+                if (error_sector >= transfer_end) {
+                    break;
+                }
+                if (error_sector >= lba && error_sector < transfer_end) {
+                    *error_lba = error_sector;
+                    ec = 1;
+                    break;
+                } else {
+                    iter = g_sequence_iter_next(iter);
+                }
+            }
+        }
+    }
+
+    qemu_mutex_unlock(&error_inject_lock);
+
+    return ec;
+}
+
+
+static void __attribute__((__constructor__)) error_inject_init(void)
+{
+    qemu_mutex_init(&error_inject_lock);
+
+    error_inject_data = g_hash_table_new_full(g_str_hash,
+                                              g_str_equal,
+                                              g_free,
+                                              delete_lba_entries);
+}
diff --git a/block/error_inject.h b/block/error_inject.h
new file mode 100644
index 0000000000..5f2d143c90
--- /dev/null
+++ b/block/error_inject.h
@@ -0,0 +1,43 @@
+/*
+ * Error injection code for block devices
+ *
+ * Copyright (c) 2019 Red Hat, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#ifndef ERROR_INJECT
+#define ERROR_INJECT
+#include "qemu/osdep.h"
+
+#include <stdint.h>
+
+#include "qapi/qapi-commands-block.h"
+#include "qapi/qapi-types-block.h"
+
+
+void media_error_create(const char *device_id, uint64_t lba,
+                        MediaErrorBehavior behavior);
+void media_error_delete(const char *device_id, uint64_t lba);
+
+
+int error_in_read(const char *device_id, uint64_t lba, uint64_t len,
+                  uint64_t *error_lba);
+
+#endif
diff --git a/block/qapi.c b/block/qapi.c
index 15f1030264..d66201a831 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -26,11 +26,14 @@
 #include "qemu-common.h"
 #include "block/qapi.h"
 #include "block/block_int.h"
+#include "block/error_inject.h"
 #include "block/throttle-groups.h"
 #include "block/write-threshold.h"
 #include "qapi/error.h"
+#include "qapi/qapi-commands-block.h"
 #include "qapi/qapi-commands-block-core.h"
 #include "qapi/qobject-output-visitor.h"
+#include "qapi/qapi-types-block.h"
 #include "qapi/qapi-visit-block-core.h"
 #include "qapi/qmp/qbool.h"
 #include "qapi/qmp/qdict.h"
@@ -841,3 +844,18 @@ void bdrv_image_info_dump(ImageInfo *info)
         bdrv_image_info_specific_dump(info->format_specific);
     }
 }
+
+void qmp_media_error_create(const char *device_id, uint64_t lba,
+        MediaErrorBehavior behavior, Error **errp)
+{
+    /*
+     * We could validate the device_id and lba for existence and range, but we
+     * want to be able to add entries before a device is hot plugged too.
+     */
+    media_error_create(device_id, lba, behavior);
+}
+
+void qmp_media_error_delete(const char *device_id, uint64_t lba, Error **errp)
+{
+    media_error_delete(device_id, lba);
+}
diff --git a/qapi/block.json b/qapi/block.json
index 145c268bb6..2b11954b44 100644
--- a/qapi/block.json
+++ b/qapi/block.json
@@ -453,3 +453,39 @@
 { 'event': 'QUORUM_REPORT_BAD',
   'data': { 'type': 'QuorumOpType', '*error': 'str', 'node-name': 'str',
             'sector-num': 'int', 'sectors-count': 'int' } }
+
+##
+# @MediaErrorBehavior:
+#
+# Enumerated type for classifying type of read error behavior.
+#
+##
+{ 'enum': 'MediaErrorBehavior', 'data': ['hard', 'fixed_on_write', 'flakey'] }
+
+##
+# @media-error-create:
+#
+# Example:
+# -> {"execute": "media-error-create",
+#     "arguments": {"device_id": "12345678, "lba" : 10000, "behavior" : "hard"}}
+# <- { "return": {} }
+#
+# Create a synthetic media error for the specified logical block address when
+# it is read by the guest OS.
+#
+##
+{ 'command':'media-error-create',
+  'data': {'device_id': 'str', 'lba': 'uint64', 'behavior': 'MediaErrorBehavior'}}
+
+##
+# @media-error-delete:
+#
+# Delete a synthetic media error for the specified logical block address.
+#
+# # Example:
+# -> {"execute": "media-error-delete",
+#     "arguments": {"device_id": "01020304", "lba" : 10000}
+# <- { "return": {} }
+##
+{ 'command':'media-error-delete',
+  'data': {'device_id': 'str', 'lba': 'uint64'}}
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC 2/4] SCSI media error reporting
  2019-09-19 19:48 [RFC 0/4] POC: Generating realistic block errors Tony Asleson
  2019-09-19 19:48 ` [RFC 1/4] Add qapi for block error injection Tony Asleson
@ 2019-09-19 19:48 ` Tony Asleson
  2019-09-19 19:48 ` [RFC 3/4] NVMe " Tony Asleson
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: Tony Asleson @ 2019-09-19 19:48 UTC (permalink / raw)
  To: qemu-devel, kwolf

Add abilility to return media errors for SCSI block devices.  Needs
improvement.

Signed-off-by: Tony Asleson <tasleson@redhat.com>
---
 hw/scsi/scsi-disk.c  | 33 +++++++++++++++++++++++++++++++++
 include/scsi/utils.h |  4 ++++
 scsi/utils.c         | 31 +++++++++++++++++++++++++++++++
 3 files changed, 68 insertions(+)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 915641a0f1..a7f3274cdf 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -37,6 +37,7 @@
 #include "sysemu/dma.h"
 #include "qemu/cutils.h"
 #include "trace.h"
+#include "block/error_inject.h"
 
 #ifdef __linux
 #include <scsi/sg.h>
@@ -132,6 +133,18 @@ static void scsi_check_condition(SCSIDiskReq *r, SCSISense sense)
     scsi_req_complete(&r->req, CHECK_CONDITION);
 }
 
+/* Helper function for SCSI media error  */
+static void scsi_media_error(SCSIDiskReq *r, SCSISense sense, uint32_t lba)
+{
+    trace_scsi_disk_check_condition(r->req.tag, sense.key, sense.asc,
+                                    sense.ascq);
+
+    r->req.sense_len = scsi_build_sense_buf_info(r->req.sense,
+                                                  SCSI_SENSE_LEN, sense, lba,
+                                                  0x80, 32);
+    scsi_req_complete(&r->req, CHECK_CONDITION);
+}
+
 static void scsi_init_iovec(SCSIDiskReq *r, size_t size)
 {
     SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
@@ -2170,6 +2183,26 @@ static int32_t scsi_disk_dma_command(SCSIRequest *req, uint8_t *buf)
         }
         r->sector = r->req.cmd.lba * (s->qdev.blocksize / 512);
         r->sector_count = len * (s->qdev.blocksize / 512);
+
+        /*
+         * TODO Move this check to a more appropriate spot.  Additionally we
+         * also need to transfer the data that was read up before encountering
+         * the media error.
+         */
+        uint64_t error_sector = 0;
+        char device_id[32];
+        sprintf(device_id, "%lu", s->qdev.wwn);
+        if (error_in_read(device_id, r->sector, r->sector_count,
+                          &error_sector)) {
+            /*
+             * TODO Fix error reporting for disks > 2TiB
+             */
+            if (error_sector > 0xFFFFFFFF) {
+                error_sector = 0xFFFFFFFF;
+            }
+            scsi_media_error(r, SENSE_CODE(READ_ERROR), (uint32_t)error_sector);
+            return 0;
+        }
         break;
     case WRITE_6:
     case WRITE_10:
diff --git a/include/scsi/utils.h b/include/scsi/utils.h
index fbc5588279..02ae74287f 100644
--- a/include/scsi/utils.h
+++ b/include/scsi/utils.h
@@ -35,6 +35,10 @@ SCSISense scsi_parse_sense_buf(const uint8_t *in_buf, int in_len);
 int scsi_build_sense_buf(uint8_t *buf, size_t max_size, SCSISense sense,
                          bool fixed_sense);
 
+int scsi_build_sense_buf_info(uint8_t *out_buf, size_t size, SCSISense sense,
+                              uint32_t information, uint8_t sksv,
+                              uint16_t sense_key_specific_info);
+
 /*
  * Predefined sense codes
  */
diff --git a/scsi/utils.c b/scsi/utils.c
index c50e81fdb8..9c9f1735d0 100644
--- a/scsi/utils.c
+++ b/scsi/utils.c
@@ -147,6 +147,37 @@ int scsi_build_sense_buf(uint8_t *out_buf, size_t size, SCSISense sense,
     return len;
 }
 
+int scsi_build_sense_buf_info(uint8_t *out_buf, size_t size, SCSISense sense,
+                                uint32_t information, uint8_t sksv,
+                                uint16_t sense_key_specific_info)
+{
+    uint8_t buf[SCSI_SENSE_LEN] = { 0 };
+
+    buf[0] = 0xF0;  /* Valid bit set. */
+    buf[2] = sense.key;
+
+    /*
+     * Set bytes 3, 4, 5, 6 value of information field
+     */
+    *((uint32_t *)(&buf[3])) = cpu_to_be32(information);
+
+    buf[7] = 10;
+    buf[12] = sense.asc;
+    buf[13] = sense.ascq;
+
+    if (sksv) {
+        buf[15] = sksv;
+        /*
+         * Set bytes 16, 17 to Sense-key specific bytes
+         */
+        *((uint16_t *)&buf[16]) = cpu_to_be16(sense_key_specific_info);
+    }
+
+    int len = MIN(SCSI_SENSE_LEN, size);
+    memcpy(out_buf, buf, len);
+    return len;
+}
+
 int scsi_build_sense(uint8_t *buf, SCSISense sense)
 {
     return scsi_build_sense_buf(buf, SCSI_SENSE_LEN, sense, true);
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC 3/4] NVMe media error reporting
  2019-09-19 19:48 [RFC 0/4] POC: Generating realistic block errors Tony Asleson
  2019-09-19 19:48 ` [RFC 1/4] Add qapi for block error injection Tony Asleson
  2019-09-19 19:48 ` [RFC 2/4] SCSI media error reporting Tony Asleson
@ 2019-09-19 19:48 ` Tony Asleson
  2019-09-19 19:48 ` [RFC 4/4] ahci " Tony Asleson
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: Tony Asleson @ 2019-09-19 19:48 UTC (permalink / raw)
  To: qemu-devel, kwolf

Rudimentary and basic support for returning NVMe errors.

Signed-off-by: Tony Asleson <tasleson@redhat.com>
---
 hw/block/nvme.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 12d8254250..faf72c2b8c 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -25,6 +25,7 @@
  * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
  */
 
+#include "block/error_inject.h"
 #include "qemu/osdep.h"
 #include "qemu/units.h"
 #include "hw/block/block.h"
@@ -390,6 +391,13 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
         return NVME_LBA_RANGE | NVME_DNR;
     }
 
+    if (!is_write) {
+        uint64_t error_sector = 0;
+        if (error_in_read(n->serial, slba, nlb, &error_sector)) {
+            return NVME_UNRECOVERED_READ | NVME_DNR;
+        }
+    }
+
     if (nvme_map_prp(&req->qsg, &req->iov, prp1, prp2, data_size, n)) {
         block_acct_invalid(blk_get_stats(n->conf.blk), acct);
         return NVME_INVALID_FIELD | NVME_DNR;
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC 4/4] ahci media error reporting
  2019-09-19 19:48 [RFC 0/4] POC: Generating realistic block errors Tony Asleson
                   ` (2 preceding siblings ...)
  2019-09-19 19:48 ` [RFC 3/4] NVMe " Tony Asleson
@ 2019-09-19 19:48 ` Tony Asleson
  2019-09-19 20:43   ` John Snow
  2019-09-20  8:36 ` [RFC 0/4] POC: Generating realistic block errors Kevin Wolf
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Tony Asleson @ 2019-09-19 19:48 UTC (permalink / raw)
  To: qemu-devel, kwolf

Initial attempt at returning a media error for ahci.  This is certainly
wrong and needs serious improvement.

Signed-off-by: Tony Asleson <tasleson@redhat.com>
---
 hw/ide/ahci.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index d45393c019..f487764106 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -36,6 +36,7 @@
 #include "hw/ide/internal.h"
 #include "hw/ide/pci.h"
 #include "ahci_internal.h"
+#include "block/error_inject.h"
 
 #include "trace.h"
 
@@ -999,6 +1000,22 @@ static void ncq_err(NCQTransferState *ncq_tfs)
     ncq_tfs->used = 0;
 }
 
+/*
+ * Figure out correct way to report media error, this is at best a guess
+ * and based on the output of linux kernel, not even remotely close.
+ */
+static void ncq_media_err(NCQTransferState *ncq_tfs, uint64_t err_sector)
+{
+    IDEState *ide_state = &ncq_tfs->drive->port.ifs[0];
+
+    ide_state->error = ECC_ERR;
+    ide_state->status = READY_STAT | ERR_STAT;
+    ncq_tfs->drive->port_regs.scr_err |= (1 << ncq_tfs->tag);
+    ncq_tfs->lba = err_sector;
+    qemu_sglist_destroy(&ncq_tfs->sglist);
+    ncq_tfs->used = 0;
+}
+
 static void ncq_finish(NCQTransferState *ncq_tfs)
 {
     /* If we didn't error out, set our finished bit. Errored commands
@@ -1065,6 +1082,8 @@ static void execute_ncq_command(NCQTransferState *ncq_tfs)
 {
     AHCIDevice *ad = ncq_tfs->drive;
     IDEState *ide_state = &ad->port.ifs[0];
+    uint64_t error_sector = 0;
+    char device_id[32];
     int port = ad->port_no;
 
     g_assert(is_ncq(ncq_tfs->cmd));
@@ -1072,6 +1091,14 @@ static void execute_ncq_command(NCQTransferState *ncq_tfs)
 
     switch (ncq_tfs->cmd) {
     case READ_FPDMA_QUEUED:
+        sprintf(device_id, "%lu", ide_state->wwn);
+
+        if (error_in_read(device_id, ncq_tfs->lba,
+                ncq_tfs->sector_count, &error_sector)) {
+            ncq_media_err(ncq_tfs, error_sector);
+            return;
+        }
+
         trace_execute_ncq_command_read(ad->hba, port, ncq_tfs->tag,
                                        ncq_tfs->sector_count, ncq_tfs->lba);
         dma_acct_start(ide_state->blk, &ncq_tfs->acct,
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC 4/4] ahci media error reporting
  2019-09-19 19:48 ` [RFC 4/4] ahci " Tony Asleson
@ 2019-09-19 20:43   ` John Snow
  2019-09-19 21:49     ` Tony Asleson
  2019-09-20  8:43     ` Kevin Wolf
  0 siblings, 2 replies; 30+ messages in thread
From: John Snow @ 2019-09-19 20:43 UTC (permalink / raw)
  To: Tony Asleson; +Cc: Kevin Wolf, qemu-devel, Qemu-block



On 9/19/19 3:48 PM, Tony Asleson wrote:
> Initial attempt at returning a media error for ahci.  This is certainly
> wrong and needs serious improvement.
> 

Hi; I have the unfortunate distinction of being the AHCI maintainer.
Please CC me on future revisions and discussion if you are interacting
with my problem child.

Also remember to CC qemu-block.

> Signed-off-by: Tony Asleson <tasleson@redhat.com>
> ---
>  hw/ide/ahci.c | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
> index d45393c019..f487764106 100644
> --- a/hw/ide/ahci.c
> +++ b/hw/ide/ahci.c
> @@ -36,6 +36,7 @@
>  #include "hw/ide/internal.h"
>  #include "hw/ide/pci.h"
>  #include "ahci_internal.h"
> +#include "block/error_inject.h"
>  
>  #include "trace.h"
>  
> @@ -999,6 +1000,22 @@ static void ncq_err(NCQTransferState *ncq_tfs)
>      ncq_tfs->used = 0;
>  }
>  
> +/*
> + * Figure out correct way to report media error, this is at best a guess
> + * and based on the output of linux kernel, not even remotely close.
> + */

Depends on what kind of error, exactly, you're trying to report, and at
what level. (AHCI, NCQ, SATA, ATA?)

Keep in mind that you've inserted an error path for SATA drives using
NCQ, but seemingly haven't touched SATA drives not using NCQ, or ATAPI
devices using either PATA/SATA, or ATA drives on PATA.

> +static void ncq_media_err(NCQTransferState *ncq_tfs, uint64_t err_sector)
> +{
> +    IDEState *ide_state = &ncq_tfs->drive->port.ifs[0];
> +
> +    ide_state->error = ECC_ERR;
> +    ide_state->status = READY_STAT | ERR_STAT;
> +    ncq_tfs->drive->port_regs.scr_err |= (1 << ncq_tfs->tag);
> +    ncq_tfs->lba = err_sector;
> +    qemu_sglist_destroy(&ncq_tfs->sglist);
> +    ncq_tfs->used = 0;
> +}
> +

If you are definitely very sure you only want an ide_state error
difference, you could just as well call ncq_err and then patch
ide_state->error.

(I am not sure that's what you want, but... maybe it is?)

I'd have to check -- because I can't say the AHCI emulator was designed
so much as congealed -- but you might need calls to ncq_finish.

usually, ncq_cb handles the return from any NCQ command and will call
ncq_err and ncq_finish as appropriate to tidy up the command.

It might be a mistake that execute_ncq_command issues ncq_err in the
`default` arm of the switch statement without a call to finish.

If we do call ncq_finish from this context I'm not sure if we want
block_acct_done here unconditionally. We may not have started a block
accounting operation if we never started a backend operation. Everything
else looks about right to me.


>  static void ncq_finish(NCQTransferState *ncq_tfs)
>  {
>      /* If we didn't error out, set our finished bit. Errored commands
> @@ -1065,6 +1082,8 @@ static void execute_ncq_command(NCQTransferState *ncq_tfs)
>  {
>      AHCIDevice *ad = ncq_tfs->drive;
>      IDEState *ide_state = &ad->port.ifs[0];
> +    uint64_t error_sector = 0;
> +    char device_id[32];
>      int port = ad->port_no;
>  
>      g_assert(is_ncq(ncq_tfs->cmd));
> @@ -1072,6 +1091,14 @@ static void execute_ncq_command(NCQTransferState *ncq_tfs)
>  
>      switch (ncq_tfs->cmd) {
>      case READ_FPDMA_QUEUED:
> +        sprintf(device_id, "%lu", ide_state->wwn);

This seems suspicious for your design in general, but I'd like to not
run sprintf to a buffer in the hotpath for NCQ.

If you need this, I'd do it when wwn is set and just grab it from the
state structure.

> +
> +        if (error_in_read(device_id, ncq_tfs->lba,
> +                ncq_tfs->sector_count, &error_sector)) {
> +            ncq_media_err(ncq_tfs, error_sector);
> +            return;
> +        }
> +

One of the downsides to trying to trigger read error injections
per-device instead of per-node is that now you have to goof around with
device-specific code everywhere.

I suppose from your cover letter you *WANT* device-specific error
exercising, which would necessitate a different design from blkdebug,
but it means you have to add support for the framework per-device and it
might be very tricky to get right.

>          trace_execute_ncq_command_read(ad->hba, port, ncq_tfs->tag,
>                                         ncq_tfs->sector_count, ncq_tfs->lba);
>          dma_acct_start(ide_state->blk, &ncq_tfs->acct,
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 4/4] ahci media error reporting
  2019-09-19 20:43   ` John Snow
@ 2019-09-19 21:49     ` Tony Asleson
  2019-09-20 17:22       ` John Snow
  2019-09-20  8:43     ` Kevin Wolf
  1 sibling, 1 reply; 30+ messages in thread
From: Tony Asleson @ 2019-09-19 21:49 UTC (permalink / raw)
  To: John Snow; +Cc: Kevin Wolf, qemu-devel, Qemu-block

On 9/19/19 3:43 PM, John Snow wrote:
> 
> 
> On 9/19/19 3:48 PM, Tony Asleson wrote:
>> Initial attempt at returning a media error for ahci.  This is certainly
>> wrong and needs serious improvement.
>>
> 
> Hi; I have the unfortunate distinction of being the AHCI maintainer.
> Please CC me on future revisions and discussion if you are interacting
> with my problem child.

Will do and thank you for taking a look at this!

> Also remember to CC qemu-block.
> 
>> Signed-off-by: Tony Asleson <tasleson@redhat.com>
>> ---
>>  hw/ide/ahci.c | 27 +++++++++++++++++++++++++++
>>  1 file changed, 27 insertions(+)
>>
>> diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
>> index d45393c019..f487764106 100644
>> --- a/hw/ide/ahci.c
>> +++ b/hw/ide/ahci.c
>> @@ -36,6 +36,7 @@
>>  #include "hw/ide/internal.h"
>>  #include "hw/ide/pci.h"
>>  #include "ahci_internal.h"
>> +#include "block/error_inject.h"
>>  
>>  #include "trace.h"
>>  
>> @@ -999,6 +1000,22 @@ static void ncq_err(NCQTransferState *ncq_tfs)
>>      ncq_tfs->used = 0;
>>  }
>>  
>> +/*
>> + * Figure out correct way to report media error, this is at best a guess
>> + * and based on the output of linux kernel, not even remotely close.
>> + */
> 
> Depends on what kind of error, exactly, you're trying to report, and at
> what level. (AHCI, NCQ, SATA, ATA?)

I was trying to return a media error, like a 3/1101 for SCSI device.  I
think that is at the ATA level?


> Keep in mind that you've inserted an error path for SATA drives using
> NCQ, but seemingly haven't touched SATA drives not using NCQ, or ATAPI
> devices using either PATA/SATA, or ATA drives on PATA.

Correct, but for trying out a simple read on a SATA drive for Linux I
end up here first :-)  Well, until the kernel retries a number of times
and finally gives up and returns an error while also disabling NCQ for
the device.


>> +static void ncq_media_err(NCQTransferState *ncq_tfs, uint64_t err_sector)
>> +{
>> +    IDEState *ide_state = &ncq_tfs->drive->port.ifs[0];
>> +
>> +    ide_state->error = ECC_ERR;
>> +    ide_state->status = READY_STAT | ERR_STAT;
>> +    ncq_tfs->drive->port_regs.scr_err |= (1 << ncq_tfs->tag);
>> +    ncq_tfs->lba = err_sector;
>> +    qemu_sglist_destroy(&ncq_tfs->sglist);
>> +    ncq_tfs->used = 0;
>> +}
>> +
> 
> If you are definitely very sure you only want an ide_state error
> difference, you could just as well call ncq_err and then patch
> ide_state->error.
> 
> (I am not sure that's what you want, but... maybe it is?)

As I mentioned above, return an unrecoverable media error.

SCSI sense data can report the first sector in error and I thought I
could do the same for SATA too?

> I'd have to check -- because I can't say the AHCI emulator was designed
> so much as congealed -- but you might need calls to ncq_finish.
> 
> usually, ncq_cb handles the return from any NCQ command and will call
> ncq_err and ncq_finish as appropriate to tidy up the command.
> 
> It might be a mistake that execute_ncq_command issues ncq_err in the
> `default` arm of the switch statement without a call to finish.
> 
> If we do call ncq_finish from this context I'm not sure if we want
> block_acct_done here unconditionally. We may not have started a block
> accounting operation if we never started a backend operation. Everything
> else looks about right to me.
> 
> 
>>  static void ncq_finish(NCQTransferState *ncq_tfs)
>>  {
>>      /* If we didn't error out, set our finished bit. Errored commands
>> @@ -1065,6 +1082,8 @@ static void execute_ncq_command(NCQTransferState *ncq_tfs)
>>  {
>>      AHCIDevice *ad = ncq_tfs->drive;
>>      IDEState *ide_state = &ad->port.ifs[0];
>> +    uint64_t error_sector = 0;
>> +    char device_id[32];
>>      int port = ad->port_no;
>>  
>>      g_assert(is_ncq(ncq_tfs->cmd));
>> @@ -1072,6 +1091,14 @@ static void execute_ncq_command(NCQTransferState *ncq_tfs)
>>  
>>      switch (ncq_tfs->cmd) {
>>      case READ_FPDMA_QUEUED:
>> +        sprintf(device_id, "%lu", ide_state->wwn);
> 
> This seems suspicious for your design in general, but I'd like to not
> run sprintf to a buffer in the hotpath for NCQ.

I totally agree.

I started out using integers in the call for error_in_read, as that is
what SCSI uses internally for wwn.  Then I did NVMe and it's using a
string that doesn't apparently need to be an integer for the wwn? so I
changed it to being a string to accommodate.

I also find it interesting that when a SATA device wwid is dumped out
within the guest it doesn't appear to have any correlation with the wwn
that was passed in on the command line, eg.

-device ide-drive,drive=satadisk,bus=ahci.0,wwn=8675309

$cat /sys/block/sda/device/wwid
t10.ATA     QEMU HARDDISK                           QM00005


This does correlate for SCSI

-device scsi-hd,drive=hd,wwn=12345678

$ cat /sys/block/sdc/device/wwid
naa.0000000000bc614e


as 0xbc614e = 12345678


For NVMe, the wwn is in the wwid, but it's not immediately obvious.

Being able to correlate between the command line and what you find in
the guest would be good.


> If you need this, I'd do it when wwn is set and just grab it from the
> state structure.
> 
>> +
>> +        if (error_in_read(device_id, ncq_tfs->lba,
>> +                ncq_tfs->sector_count, &error_sector)) {
>> +            ncq_media_err(ncq_tfs, error_sector);
>> +            return;
>> +        }
>> +
> 
> One of the downsides to trying to trigger read error injections
> per-device instead of per-node is that now you have to goof around with
> device-specific code everywhere>
> I suppose from your cover letter you *WANT* device-specific error
> exercising, which would necessitate a different design from blkdebug,
> but it means you have to add support for the framework per-device and it
> might be very tricky to get right.

Yes, goal was to be able to selectively pick one or more specific block
devices and then create one or more block errors on each device with
potentially different error behavior for each block in error.


>>          trace_execute_ncq_command_read(ad->hba, port, ncq_tfs->tag,
>>                                         ncq_tfs->sector_count, ncq_tfs->lba);
>>          dma_acct_start(ide_state->blk, &ncq_tfs->acct,
>>



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/4] POC: Generating realistic block errors
  2019-09-19 19:48 [RFC 0/4] POC: Generating realistic block errors Tony Asleson
                   ` (3 preceding siblings ...)
  2019-09-19 19:48 ` [RFC 4/4] ahci " Tony Asleson
@ 2019-09-20  8:36 ` Kevin Wolf
  2019-09-20 16:41   ` Tony Asleson
  2019-09-20  9:22 ` Stefan Hajnoczi
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Kevin Wolf @ 2019-09-20  8:36 UTC (permalink / raw)
  To: Tony Asleson; +Cc: qemu-devel

Am 19.09.2019 um 21:48 hat Tony Asleson geschrieben:
> For a long time I thought that VMs could be a great way to improve OS
> code quality by modifying the simulated hardware to return errors to
> exercise error paths in the OSs, specifically in block devices for right now.
> A number of different approaches are available within the Linux kernel, eg.
> scsi-debug, dm-flakey, and others.  However, I always though it would be best to
> simulate it from the hardware.  To fully exercise the entire stack.  As a
> bonus it's OS agnostic for those projects that cross OSs and it's available
> before the OS even boots.
> 
> This POC needs a lot more work, but it's what I have so far.  Learning QEMU
> internals, plus some of the different bus types has been interesting to say
> the least.  I'm most familiar with SCSI, but the others are foreign to me.
> AHCI/SATA/ATA is very interesting with it's history and the associated code to
> handle it's evolution.
> 
> Eventually I think it would be useful to add functionality for errors on
> write paths, timeouts, and different error behaviors.  Like media error(s) that
> are corrected by a re-write (simulate failed write on power loss), bit rot
> injection etc.  I know a number of these can be solved different ways,
> but embracing them from the VM environment seems ideal to me.

Yes, this makes sense to me. The question to answer seems to be where
this functionality would fit in - first of all, frontend (inside the
device emulation) or backend. And it's not an easy quesiton, but see
below.

> Expanding to gather statistics on IO patterns, histograms etc. could
> be very useful too.  Having the ability to start/stop information
> collection and then have access to what happened and in what order
> could allow for better error injection of key FS structures and
> software RAID solutions too.

We do already gather some basic statistics, and it would probably make
sense to add more things there. This is pretty clearly a job for the
backend, I think. If the information is easy to collect, we can add it
in the normal I/O path, otherwise an optional filter block driver could
do this.

> I've recently been informed that blkdebug exists.  From a cursory
> investigation it does appear have overlap, but I haven't given it a
> try yet.  From looking at the code to insert my changes it appears
> that some of the errors I'm generating are different than what for
> example an EIO on a read_aio does, but I'm not sure.  Perhaps some of
> the other features that I've outlined above already exist too in some
> other capacity of QEMU?

The overlap seems big enough that we can't ignore it. blkdebug was
primarily meant to debug image format block drivers like qcow2, but
because this means that it can inject I/O errors if specific sectors are
accessed, it makes sense to just use the same functionality for guest
devices, too.

I/O error inserted by blkdebug can be one-off or permanent, but since it
also supports using a small state machine, I think you should already be
able to configure your errors that are corrected by a rewrite, too, even
if there is no explicit support for this yet (I guess we could add it if
it turned out to be much easier to use).

The one thing I see in your series that we can't currently provide this
way is the exact sector number where the error happened. If you read
from sector 32 to 64 and there is an error configured for sector 50, you
just see that the whole request is failing.


I also wonder why you had to write low-level error handling code instead
of calling the existing error functions. If the existing functions don't
give the right result in error cases, shouldn't they be fixed anyway?

And then, as John already hinted, adding code for debugging scenarios to
hot paths that are important for high-performance VMs that don't use any
debugging is less than optimal.


So bringing everything together, what would you think of this plan:

1. Extend blkdebug with whatever ways you need to trigger I/O errors
   (only if the existing modes aren't sufficient at least for the start;
   we can still always extend it later)

2. Add a new BlockDriver callback that can return detailed information
   about an error (such as the exact sector number), and wire it up
   through BlockBackend (some blk_* function). Implement it in blkdebug.

3. In the guest devices, only call the function to get detailed error
   information in the existing error path. You can then update some
   device state according to the details if the block driver returned
   anything (probably only blkdebug will return something).

This way, we have no changes at all in the hot I/O path if you didn't
configure your VM with a blkdebug filter. And we avoid duplication of
code both in the error handler in devices and in the error injection
mechanisms.

Kevin


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 4/4] ahci media error reporting
  2019-09-19 20:43   ` John Snow
  2019-09-19 21:49     ` Tony Asleson
@ 2019-09-20  8:43     ` Kevin Wolf
  2019-09-20 16:18       ` John Snow
  1 sibling, 1 reply; 30+ messages in thread
From: Kevin Wolf @ 2019-09-20  8:43 UTC (permalink / raw)
  To: John Snow; +Cc: Tony Asleson, qemu-devel, Qemu-block

Am 19.09.2019 um 22:43 hat John Snow geschrieben:
> I'd have to check -- because I can't say the AHCI emulator was designed
> so much as congealed -- but you might need calls to ncq_finish.
> 
> usually, ncq_cb handles the return from any NCQ command and will call
> ncq_err and ncq_finish as appropriate to tidy up the command.
> 
> It might be a mistake that execute_ncq_command issues ncq_err in the
> `default` arm of the switch statement without a call to finish.
> 
> If we do call ncq_finish from this context I'm not sure if we want
> block_acct_done here unconditionally. We may not have started a block
> accounting operation if we never started a backend operation. Everything
> else looks about right to me.

With that much uncertainty, the one thing I'm pretty certain of is that
someone (TM) should write some qtests - if only to figure out what
really happens.

Kevin


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 1/4] Add qapi for block error injection
  2019-09-19 19:48 ` [RFC 1/4] Add qapi for block error injection Tony Asleson
@ 2019-09-20  9:03   ` Philippe Mathieu-Daudé
  2019-09-20 15:17     ` Tony Asleson
  0 siblings, 1 reply; 30+ messages in thread
From: Philippe Mathieu-Daudé @ 2019-09-20  9:03 UTC (permalink / raw)
  To: Tony Asleson, qemu-devel, kwolf

Hi Tony,

On 9/19/19 9:48 PM, Tony Asleson wrote:
> Proof of concept code to dynamically create/delete media errors
> for block devices using the qapi interface.  This is useful for testing
> the OS and the block and FS layers for error handling.  Utilizing this
> in a VM allows it to be OS agnostic and test the OS at it's lowest levels
> of hardware interaction.
> 
> Signed-off-by: Tony Asleson <tasleson@redhat.com>
> ---
>  block/Makefile.objs  |   2 +-
>  block/error_inject.c | 179 +++++++++++++++++++++++++++++++++++++++++++
>  block/error_inject.h |  43 +++++++++++
>  block/qapi.c         |  18 +++++
>  qapi/block.json      |  36 +++++++++
>  5 files changed, 277 insertions(+), 1 deletion(-)
>  create mode 100644 block/error_inject.c
>  create mode 100644 block/error_inject.h
> 
> diff --git a/block/Makefile.objs b/block/Makefile.objs
> index 35f3bca4d9..c8fcbc276b 100644
> --- a/block/Makefile.objs
> +++ b/block/Makefile.objs
> @@ -1,4 +1,4 @@
> -block-obj-y += raw-format.o vmdk.o vpc.o
> +block-obj-y += raw-format.o vmdk.o vpc.o error_inject.o
>  block-obj-$(CONFIG_QCOW1) += qcow.o
>  block-obj-$(CONFIG_VDI) += vdi.o
>  block-obj-$(CONFIG_CLOOP) += cloop.o
> diff --git a/block/error_inject.c b/block/error_inject.c
> new file mode 100644
> index 0000000000..26a47dfb8c
> --- /dev/null
> +++ b/block/error_inject.c
> @@ -0,0 +1,179 @@
> +/*
> + * Error injection code for block devices
> + *
> + * Copyright (c) 2019 Red Hat, Inc.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> + * of this software and associated documentation files (the "Software"), to deal
> + * in the Software without restriction, including without limitation the rights
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> + * copies of the Software, and to permit persons to whom the Software is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> + * THE SOFTWARE.
> + */
> +
> +#include "error_inject.h"
> +
> +#include <gmodule.h>
> +
> +#include "qemu/thread.h"
> +
> +static QemuMutex error_inject_lock;
> +static GHashTable *error_inject_data;
> +
> +
> +static void delete_lba_entries(void *entry)
> +{
> +    GSequence *e = (GSequence *) entry;
> +    g_sequence_free(e);
> +}
> +
> +struct value {
> +    uint64_t error_lba;
> +
> +    /*
> +     * TODO Actually do something with behavior
> +     */
> +    MediaErrorBehavior behavior;
> +    /*
> +     * TODO Add data for generating bitrot, when we do change free function
> +     */
> +};
> +
> +static int key_compare(gconstpointer a, gconstpointer b, gpointer data)
> +{
> +    uint64_t left = ((struct value *)a)->error_lba;
> +    uint64_t right = ((struct value *)b)->error_lba;
> +
> +    if (left < right) {
> +        return -1;
> +    } else if (left > right) {
> +        return 1;
> +    } else {
> +        return 0;
> +    }
> +}
> +
> +static uint64_t error_lba_get(GSequenceIter *iter)
> +{
> +    gpointer tmp = g_sequence_get(iter);
> +    return ((struct value *)tmp)->error_lba;
> +}
> +
> +void media_error_create(const char *device_id, uint64_t lba,
> +                        MediaErrorBehavior behavior)
> +{
> +    qemu_mutex_lock(&error_inject_lock);
> +
> +    GSequence *block_device = g_hash_table_lookup(error_inject_data, device_id);
> +    if (!block_device) {
> +        block_device = g_sequence_new(g_free);
> +        char *key = strdup(device_id);
> +        g_hash_table_insert(error_inject_data, key, block_device);
> +    }
> +
> +    struct value lookup = {lba, MEDIA_ERROR_BEHAVIOR__MAX};
> +    if (!g_sequence_lookup(block_device, &lookup, key_compare, NULL)) {
> +        struct value *val = g_new(struct value, 1);
> +        val->error_lba = lba;
> +        val->behavior = behavior;
> +
> +        g_sequence_insert_sorted(block_device, val, key_compare, NULL);
> +    }
> +
> +    qemu_mutex_unlock(&error_inject_lock);
> +}
> +
> +void media_error_delete(const char *device_id, uint64_t lba)

Can you explain the goal of this function? I don't understand it from
the hardware PoV. Or is it a simple code cleanup function?

Once hw starts to break, it is unlikely it recovers magically to a
previous sane state...

In real world, block hardware starts to fail, controller notice
inconsistency and set some bits, the controller driver react.
Eventually the hw will still be used, with the failing block avoided.

What do you think? How is your current use case setup?

Thanks,

Phil.

> +{
> +    qemu_mutex_lock(&error_inject_lock);
> +
> +    GSequence *block_device = g_hash_table_lookup(error_inject_data, device_id);
> +    if (block_device) {
> +        struct value find = { lba, MEDIA_ERROR_BEHAVIOR__MAX};
> +        GSequenceIter *found = g_sequence_lookup(block_device, &find,
> +                                                 key_compare, NULL);
> +        if (found) {
> +            g_sequence_remove(found);
> +        }
> +    }
> +
> +    qemu_mutex_unlock(&error_inject_lock);
> +}
> +
> +int error_in_read(const char *device_id, uint64_t lba, uint64_t len,
> +                  uint64_t *error_lba)
> +{
> +    uint64_t error_sector = 0;
> +    const uint64_t transfer_end = lba + len;
> +    int ec = 0;
> +    *error_lba = 0xFFFFFFFFFFFFFFFF;
> +
> +    qemu_mutex_lock(&error_inject_lock);
> +
> +    GSequence *block_device = g_hash_table_lookup(error_inject_data, device_id);
> +    if (block_device && g_sequence_get_length(block_device) != 0) {
> +        struct value find = {lba, MEDIA_ERROR_BEHAVIOR__MAX};
> +        GSequenceIter *iter = g_sequence_search(block_device, &find,
> +                                                key_compare, NULL);
> +
> +        /*
> +         * g_sequence_seach returns where the item would be inserted.
> +         * In the case of a direct match, it's spot is inserted after the
> +         * existing, thus we need to check the one immediately before
> +         * the insertion point to see if it is the one we are looking for.
> +         */
> +        GSequenceIter *prev = g_sequence_iter_prev(iter);
> +        error_sector = error_lba_get(prev);
> +
> +        if (error_sector >= lba && error_sector < transfer_end) {
> +            *error_lba = error_sector;
> +            ec = 1;
> +        } else {
> +            /*
> +             * Lets look at next until we find one in our transfer or bail
> +             * if the error(s) logical block address are greater than the
> +             * end of our transfer.
> +             */
> +            while (!g_sequence_iter_is_end(iter)) {
> +                error_sector = error_lba_get(iter);
> +
> +                if (error_sector >= transfer_end) {
> +                    break;
> +                }
> +                if (error_sector >= lba && error_sector < transfer_end) {
> +                    *error_lba = error_sector;
> +                    ec = 1;
> +                    break;
> +                } else {
> +                    iter = g_sequence_iter_next(iter);
> +                }
> +            }
> +        }
> +    }
> +
> +    qemu_mutex_unlock(&error_inject_lock);
> +
> +    return ec;
> +}
> +
> +
> +static void __attribute__((__constructor__)) error_inject_init(void)
> +{
> +    qemu_mutex_init(&error_inject_lock);
> +
> +    error_inject_data = g_hash_table_new_full(g_str_hash,
> +                                              g_str_equal,
> +                                              g_free,
> +                                              delete_lba_entries);
> +}
> diff --git a/block/error_inject.h b/block/error_inject.h
> new file mode 100644
> index 0000000000..5f2d143c90
> --- /dev/null
> +++ b/block/error_inject.h
> @@ -0,0 +1,43 @@
> +/*
> + * Error injection code for block devices
> + *
> + * Copyright (c) 2019 Red Hat, Inc.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> + * of this software and associated documentation files (the "Software"), to deal
> + * in the Software without restriction, including without limitation the rights
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> + * copies of the Software, and to permit persons to whom the Software is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> + * THE SOFTWARE.
> + */
> +
> +#ifndef ERROR_INJECT
> +#define ERROR_INJECT
> +#include "qemu/osdep.h"
> +
> +#include <stdint.h>
> +
> +#include "qapi/qapi-commands-block.h"
> +#include "qapi/qapi-types-block.h"
> +
> +
> +void media_error_create(const char *device_id, uint64_t lba,
> +                        MediaErrorBehavior behavior);
> +void media_error_delete(const char *device_id, uint64_t lba);
> +
> +
> +int error_in_read(const char *device_id, uint64_t lba, uint64_t len,
> +                  uint64_t *error_lba);
> +
> +#endif
> diff --git a/block/qapi.c b/block/qapi.c
> index 15f1030264..d66201a831 100644
> --- a/block/qapi.c
> +++ b/block/qapi.c
> @@ -26,11 +26,14 @@
>  #include "qemu-common.h"
>  #include "block/qapi.h"
>  #include "block/block_int.h"
> +#include "block/error_inject.h"
>  #include "block/throttle-groups.h"
>  #include "block/write-threshold.h"
>  #include "qapi/error.h"
> +#include "qapi/qapi-commands-block.h"
>  #include "qapi/qapi-commands-block-core.h"
>  #include "qapi/qobject-output-visitor.h"
> +#include "qapi/qapi-types-block.h"
>  #include "qapi/qapi-visit-block-core.h"
>  #include "qapi/qmp/qbool.h"
>  #include "qapi/qmp/qdict.h"
> @@ -841,3 +844,18 @@ void bdrv_image_info_dump(ImageInfo *info)
>          bdrv_image_info_specific_dump(info->format_specific);
>      }
>  }
> +
> +void qmp_media_error_create(const char *device_id, uint64_t lba,
> +        MediaErrorBehavior behavior, Error **errp)
> +{
> +    /*
> +     * We could validate the device_id and lba for existence and range, but we
> +     * want to be able to add entries before a device is hot plugged too.
> +     */
> +    media_error_create(device_id, lba, behavior);
> +}
> +
> +void qmp_media_error_delete(const char *device_id, uint64_t lba, Error **errp)
> +{
> +    media_error_delete(device_id, lba);
> +}
> diff --git a/qapi/block.json b/qapi/block.json
> index 145c268bb6..2b11954b44 100644
> --- a/qapi/block.json
> +++ b/qapi/block.json
> @@ -453,3 +453,39 @@
>  { 'event': 'QUORUM_REPORT_BAD',
>    'data': { 'type': 'QuorumOpType', '*error': 'str', 'node-name': 'str',
>              'sector-num': 'int', 'sectors-count': 'int' } }
> +
> +##
> +# @MediaErrorBehavior:
> +#
> +# Enumerated type for classifying type of read error behavior.
> +#
> +##
> +{ 'enum': 'MediaErrorBehavior', 'data': ['hard', 'fixed_on_write', 'flakey'] }
> +
> +##
> +# @media-error-create:
> +#
> +# Example:
> +# -> {"execute": "media-error-create",
> +#     "arguments": {"device_id": "12345678, "lba" : 10000, "behavior" : "hard"}}
> +# <- { "return": {} }
> +#
> +# Create a synthetic media error for the specified logical block address when
> +# it is read by the guest OS.
> +#
> +##
> +{ 'command':'media-error-create',
> +  'data': {'device_id': 'str', 'lba': 'uint64', 'behavior': 'MediaErrorBehavior'}}
> +
> +##
> +# @media-error-delete:
> +#
> +# Delete a synthetic media error for the specified logical block address.
> +#
> +# # Example:
> +# -> {"execute": "media-error-delete",
> +#     "arguments": {"device_id": "01020304", "lba" : 10000}
> +# <- { "return": {} }
> +##
> +{ 'command':'media-error-delete',
> +  'data': {'device_id': 'str', 'lba': 'uint64'}}
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/4] POC: Generating realistic block errors
  2019-09-19 19:48 [RFC 0/4] POC: Generating realistic block errors Tony Asleson
                   ` (4 preceding siblings ...)
  2019-09-20  8:36 ` [RFC 0/4] POC: Generating realistic block errors Kevin Wolf
@ 2019-09-20  9:22 ` Stefan Hajnoczi
  2019-09-20 17:28   ` Tony Asleson
  2019-09-20  9:25 ` Stefan Hajnoczi
  2019-09-20 14:41 ` no-reply
  7 siblings, 1 reply; 30+ messages in thread
From: Stefan Hajnoczi @ 2019-09-20  9:22 UTC (permalink / raw)
  To: Tony Asleson; +Cc: kwolf, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2898 bytes --]

On Thu, Sep 19, 2019 at 02:48:43PM -0500, Tony Asleson wrote:
> For a long time I thought that VMs could be a great way to improve OS
> code quality by modifying the simulated hardware to return errors to
> exercise error paths in the OSs, specifically in block devices for right now.
> A number of different approaches are available within the Linux kernel, eg.
> scsi-debug, dm-flakey, and others.  However, I always though it would be best to
> simulate it from the hardware.  To fully exercise the entire stack.  As a
> bonus it's OS agnostic for those projects that cross OSs and it's available
> before the OS even boots.
> 
> This POC needs a lot more work, but it's what I have so far.  Learning QEMU
> internals, plus some of the different bus types has been interesting to say
> the least.  I'm most familiar with SCSI, but the others are foreign to me.
> AHCI/SATA/ATA is very interesting with it's history and the associated code to
> handle it's evolution.
> 
> Eventually I think it would be useful to add functionality for errors on
> write paths, timeouts, and different error behaviors.  Like media error(s) that
> are corrected by a re-write (simulate failed write on power loss), bit rot
> injection etc.  I know a number of these can be solved different ways,
> but embracing them from the VM environment seems ideal to me.  Expanding
> to gather statistics on IO patterns, histograms etc. could be very
> useful too.  Having the ability to start/stop information collection and
> then have access to what happened and in what order could allow for
> better error injection of key FS structures and software RAID solutions too.
> 
> I've recently been informed that blkdebug exists.  From a cursory investigation
> it does appear have overlap, but I haven't given it a try yet.  From looking
> at the code to insert my changes it appears that some of the errors I'm
> generating are different than what for example an EIO on a read_aio does, but
> I'm not sure.  Perhaps some of the other features that I've outlined above
> already exist too in some other capacity of QEMU?

blkdebug is purely at the QEMU block layer level.  It is not aware of
storage controller-specific error information or features.  If you want
to inject NVMe- or SCSI-specific errors that make no sense in QEMU's
block layer, then trying to do it in blkdebug becomes a layering
violation.  This justifies adding a new error injection feature directly
into AHCI, virtio-scsi, NVMe, etc devices.

What I like about blkdebug is the state machine and relatively powerful
tests that this enables.  It makes sense to reuse those for storage
controller-specific error injection too.  In the future we may wish to
reuse it for network cards and other emulated devices too!

Perhaps the error injection "engine" (the core blkdebug code) can be
extracted and reused?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/4] POC: Generating realistic block errors
  2019-09-19 19:48 [RFC 0/4] POC: Generating realistic block errors Tony Asleson
                   ` (5 preceding siblings ...)
  2019-09-20  9:22 ` Stefan Hajnoczi
@ 2019-09-20  9:25 ` Stefan Hajnoczi
  2019-09-20 14:41 ` no-reply
  7 siblings, 0 replies; 30+ messages in thread
From: Stefan Hajnoczi @ 2019-09-20  9:25 UTC (permalink / raw)
  To: Tony Asleson; +Cc: kwolf, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 502 bytes --]

On Thu, Sep 19, 2019 at 02:48:43PM -0500, Tony Asleson wrote:
> I've recently been informed that blkdebug exists.

By the way, if error injection only needs to report media errors, then
the existing blkdebug feature should be sufficient to do this.

I think adding storage controller-specific error injection is only
really necessary for injection SCSI-, AHCI-, or NVMe-specific errors
that the QEMU block layer has no concept of.

So the question is: can blkdebug already do what you need? :)

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/4] POC: Generating realistic block errors
  2019-09-19 19:48 [RFC 0/4] POC: Generating realistic block errors Tony Asleson
                   ` (6 preceding siblings ...)
  2019-09-20  9:25 ` Stefan Hajnoczi
@ 2019-09-20 14:41 ` no-reply
  7 siblings, 0 replies; 30+ messages in thread
From: no-reply @ 2019-09-20 14:41 UTC (permalink / raw)
  To: tasleson; +Cc: kwolf, qemu-devel

Patchew URL: https://patchew.org/QEMU/20190919194847.18518-1-tasleson@redhat.com/



Hi,

This series failed the docker-mingw@fedora build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#! /bin/bash
export ARCH=x86_64
make docker-image-fedora V=1 NETWORK=1
time make docker-test-mingw@fedora J=14 NETWORK=1
=== TEST SCRIPT END ===

  CC      hw/misc/arm_sysctl.o
  CC      hw/misc/cbus.o
/tmp/qemu-test/src/hw/ide/ahci.c: In function 'execute_ncq_command':
/tmp/qemu-test/src/hw/ide/ahci.c:1094:31: error: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'uint64_t' {aka 'long long unsigned int'} [-Werror=format=]
         sprintf(device_id, "%lu", ide_state->wwn);
                             ~~^   ~~~~~~~~~~~~~~
                             %llu


The full log is available at
http://patchew.org/logs/20190919194847.18518-1-tasleson@redhat.com/testing.docker-mingw@fedora/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 1/4] Add qapi for block error injection
  2019-09-20  9:03   ` Philippe Mathieu-Daudé
@ 2019-09-20 15:17     ` Tony Asleson
  0 siblings, 0 replies; 30+ messages in thread
From: Tony Asleson @ 2019-09-20 15:17 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, qemu-devel, kwolf

On 9/20/19 4:03 AM, Philippe Mathieu-Daudé wrote:
> Hi Tony,

Hi Phil

> 
> On 9/19/19 9:48 PM, Tony Asleson wrote:
...
>> +void media_error_delete(const char *device_id, uint64_t lba)
> 
> Can you explain the goal of this function? I don't understand it from
> the hardware PoV. Or is it a simple code cleanup function?

It's just a way to do clean up.  I actually used it during development
as the Linux kernel can get very unresponsive when doing error retries ,
so I removed the error to allow things to proceed again.

> Once hw starts to break, it is unlikely it recovers magically to a
> previous sane state...
> 
> In real world, block hardware starts to fail, controller notice
> inconsistency and set some bits, the controller driver react.
> Eventually the hw will still be used, with the failing block avoided.

This is probably the most common case that people are familiar with, a
sector that goes bad and persists.

In rotating magnetic media devices, you can have a media error if a
write was interrupted by a power failure.  The sector has part of the
new write and part of the old write, CRC check fails and ECC is unable
to correct.  This can be corrected by having the host re-write the sector.

You can also have an intermittent read error if you have a transient
media problem, eg. a small particulate is moving around on the disk
surface (ransient thermal asperity, instead of persistent).

Depending on disk type (SCSI/SATA) and firmware features, the disk may
auto-reassign the block when it discovers that it had to go through lots
of recovery to retrieve the data or it may require the initiator to
issue a reassign blocks command.  In both cases the lba replacement is
ready to use.

> What do you think? How is your current use case setup?

Eventually I think it would be good to model some or all of these other
error scenarios I've outlined above.

> 
> Thanks,

Thank you!

-Tony


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 4/4] ahci media error reporting
  2019-09-20  8:43     ` Kevin Wolf
@ 2019-09-20 16:18       ` John Snow
  2019-09-20 19:25         ` Tony Asleson
  0 siblings, 1 reply; 30+ messages in thread
From: John Snow @ 2019-09-20 16:18 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Tony Asleson, qemu-devel, Qemu-block



On 9/20/19 4:43 AM, Kevin Wolf wrote:
> Am 19.09.2019 um 22:43 hat John Snow geschrieben:
>> I'd have to check -- because I can't say the AHCI emulator was designed
>> so much as congealed -- but you might need calls to ncq_finish.
>>
>> usually, ncq_cb handles the return from any NCQ command and will call
>> ncq_err and ncq_finish as appropriate to tidy up the command.
>>
>> It might be a mistake that execute_ncq_command issues ncq_err in the
>> `default` arm of the switch statement without a call to finish.
>>
>> If we do call ncq_finish from this context I'm not sure if we want
>> block_acct_done here unconditionally. We may not have started a block
>> accounting operation if we never started a backend operation. Everything
>> else looks about right to me.
> 
> With that much uncertainty, the one thing I'm pretty certain of is that
> someone (TM) should write some qtests - if only to figure out what
> really happens.
> 

For sure -- I handle the normative cases, but I don't test what happens
if you issue an unsupported NCQ command. (I don't know what real
hardware does right now, either. I'm sure I could read the spec and find
out, but don't have a testing setup that lets me analyze real hardware
anymore.)

I will have to defer this to someone (TM), but I suspect the code (that
I suspect was used as a basis for inserting a new error pathway in this
patch) is wrong and is missing a call to ncq_finish -- but that call
needs to not call the block accounting cleanup, because we didn't start
an operation in this case.

That's my Official Hunch.

--js


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/4] POC: Generating realistic block errors
  2019-09-20  8:36 ` [RFC 0/4] POC: Generating realistic block errors Kevin Wolf
@ 2019-09-20 16:41   ` Tony Asleson
  2019-09-20 17:08     ` Eric Blake
  2019-09-20 18:11     ` Kevin Wolf
  0 siblings, 2 replies; 30+ messages in thread
From: Tony Asleson @ 2019-09-20 16:41 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel

On 9/20/19 3:36 AM, Kevin Wolf wrote:
> I/O error inserted by blkdebug can be one-off or permanent, but since it
> also supports using a small state machine, I think you should already be
> able to configure your errors that are corrected by a rewrite, too, even
> if there is no explicit support for this yet (I guess we could add it if
> it turned out to be much easier to use).

One thing I thought about is the feasibility of having a callback for
these errors across qapi.  For example you could register a sector for a
read/write/both and when that operation occurs you would block IO, send
the sector number and associated data across qapi for test code to do
something with it and respond allowing the operation to continue
successfully or by returning an error determined by the external test
code to be propagated to guest.

This would allow the logic to be outside of QEMU.  So for example in the
re-write case the test code could remove the error when it gets the
write, instead of having that logic embedded in QEMU itself.

Thoughts?

> The one thing I see in your series that we can't currently provide this
> way is the exact sector number where the error happened. If you read
> from sector 32 to 64 and there is an error configured for sector 50, you
> just see that the whole request is failing.

Also depending on the device type the data behavior can be different
too.  For SCSI devices I believe the specification states that the data
leading up to the sector in error is transferred to the initiator.  For
ATA I believe this is not true.  My code doesn't model this correctly.
I generated the error before any data was transferred.

I'm thinking changes in blkdebug will need to be done to handle this too?

> I also wonder why you had to write low-level error handling code instead
> of calling the existing error functions. If the existing functions don't
> give the right result in error cases, shouldn't they be fixed anyway?

I would think so too.  I'm using error constants that already exist, but
apparently are not being used anywhere else.


> And then, as John already hinted, adding code for debugging scenarios to
> hot paths that are important for high-performance VMs that don't use any
> debugging is less than optimal.

I agree, the POC code was experimental, but I should have done more
effort in minimizing the run-time costs.

Additionally I think it would be good if QEMU could standardize the
device wwn format to be consistent throughout all block device types,
eg. uint64_t, but maybe not possible.  I also think it would be good to
allow the wwn passed on the command line correlate with what the guest
sees for /sys/block/<device>/device/wwid.

However, I'm assuming that QEMU has the same stance as the linux kernel
with no visible user space breakage?

> 
> So bringing everything together, what would you think of this plan:
> 
> 1. Extend blkdebug with whatever ways you need to trigger I/O errors
>    (only if the existing modes aren't sufficient at least for the start;
>    we can still always extend it later)
> 
> 2. Add a new BlockDriver callback that can return detailed information
>    about an error (such as the exact sector number), and wire it up
>    through BlockBackend (some blk_* function). Implement it in blkdebug.
> 
> 3. In the guest devices, only call the function to get detailed error
>    information in the existing error path. You can then update some
>    device state according to the details if the block driver returned
>    anything (probably only blkdebug will return something).
> 
> This way, we have no changes at all in the hot I/O path if you didn't
> configure your VM with a blkdebug filter. And we avoid duplication of
> code both in the error handler in devices and in the error injection
> mechanisms.

This all sounds good to me.  Although I'm not 100% sure of all the
specific details you are describing at the moment as I'm not that
familiar with the code base.

Thanks!

-Tony





^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/4] POC: Generating realistic block errors
  2019-09-20 16:41   ` Tony Asleson
@ 2019-09-20 17:08     ` Eric Blake
  2019-09-20 19:15       ` Tony Asleson
  2019-09-20 18:11     ` Kevin Wolf
  1 sibling, 1 reply; 30+ messages in thread
From: Eric Blake @ 2019-09-20 17:08 UTC (permalink / raw)
  To: tasleson, Kevin Wolf; +Cc: qemu-devel


[-- Attachment #1.1: Type: text/plain, Size: 2541 bytes --]

On 9/20/19 11:41 AM, Tony Asleson wrote:
> On 9/20/19 3:36 AM, Kevin Wolf wrote:
>> I/O error inserted by blkdebug can be one-off or permanent, but since it
>> also supports using a small state machine, I think you should already be
>> able to configure your errors that are corrected by a rewrite, too, even
>> if there is no explicit support for this yet (I guess we could add it if
>> it turned out to be much easier to use).
> 
> One thing I thought about is the feasibility of having a callback for
> these errors across qapi.  For example you could register a sector for a
> read/write/both and when that operation occurs you would block IO, send
> the sector number and associated data across qapi for test code to do
> something with it and respond allowing the operation to continue
> successfully or by returning an error determined by the external test
> code to be propagated to guest.
> 
> This would allow the logic to be outside of QEMU.  So for example in the
> re-write case the test code could remove the error when it gets the
> write, instead of having that logic embedded in QEMU itself.
> 
> Thoughts?

To some extent, this sounds similar to what you can accomplish with an
NBD disk.  You can write an nbdkit plugin which exposes whatever error
handling you want (such as "the first read to this sector fails with
EIO, but a second read succeeds"), but only insofar as it fits in the
bounds of what the NBD protocol exposes over the wire (so qemu would see
EIO errors, and could narrow in on which portion of the disk provides or
avoids those errors, but would not have any additional insights that
would resemble a hardware-specific query without extensions to the NBD
protocol).

I am worried, however, that making data transactions have to go through
QMP to get an answer on how to handle a specific guest request will slow
things down; QMP is not built to be an efficient dataplane interface.
If you truly want isolation (where another process receives all guest
transactions, and makes decisions on how to handle them), it seems like
writing up a remote server (as in 'nbdkit' for the NBD protocol, or a
custom provider for the iscsi protocol) is the way to go.

[I have no idea if there is an iscsi counterpart for nbdkit; the iscsi
protocol is notoriously more complex than the NBD protocol, so it's not
something I'm likely to write]


-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 4/4] ahci media error reporting
  2019-09-19 21:49     ` Tony Asleson
@ 2019-09-20 17:22       ` John Snow
  0 siblings, 0 replies; 30+ messages in thread
From: John Snow @ 2019-09-20 17:22 UTC (permalink / raw)
  To: tasleson; +Cc: Kevin Wolf, qemu-devel, Qemu-block



On 9/19/19 5:49 PM, Tony Asleson wrote:
> On 9/19/19 3:43 PM, John Snow wrote:
>>
>>
>> On 9/19/19 3:48 PM, Tony Asleson wrote:
>>> Initial attempt at returning a media error for ahci.  This is certainly
>>> wrong and needs serious improvement.
>>>
>>
>> Hi; I have the unfortunate distinction of being the AHCI maintainer.
>> Please CC me on future revisions and discussion if you are interacting
>> with my problem child.
> 
> Will do and thank you for taking a look at this!
> 
>> Also remember to CC qemu-block.
>>
>>> Signed-off-by: Tony Asleson <tasleson@redhat.com>
>>> ---
>>>  hw/ide/ahci.c | 27 +++++++++++++++++++++++++++
>>>  1 file changed, 27 insertions(+)
>>>
>>> diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
>>> index d45393c019..f487764106 100644
>>> --- a/hw/ide/ahci.c
>>> +++ b/hw/ide/ahci.c
>>> @@ -36,6 +36,7 @@
>>>  #include "hw/ide/internal.h"
>>>  #include "hw/ide/pci.h"
>>>  #include "ahci_internal.h"
>>> +#include "block/error_inject.h"
>>>  
>>>  #include "trace.h"
>>>  
>>> @@ -999,6 +1000,22 @@ static void ncq_err(NCQTransferState *ncq_tfs)
>>>      ncq_tfs->used = 0;
>>>  }
>>>  
>>> +/*
>>> + * Figure out correct way to report media error, this is at best a guess
>>> + * and based on the output of linux kernel, not even remotely close.
>>> + */
>>
>> Depends on what kind of error, exactly, you're trying to report, and at
>> what level. (AHCI, NCQ, SATA, ATA?)
> 
> I was trying to return a media error, like a 3/1101 for SCSI device.  I
> think that is at the ATA level?
> 
> 
>> Keep in mind that you've inserted an error path for SATA drives using
>> NCQ, but seemingly haven't touched SATA drives not using NCQ, or ATAPI
>> devices using either PATA/SATA, or ATA drives on PATA.
> 
> Correct, but for trying out a simple read on a SATA drive for Linux I
> end up here first :-)  Well, until the kernel retries a number of times
> and finally gives up and returns an error while also disabling NCQ for
> the device.
> 
> 
>>> +static void ncq_media_err(NCQTransferState *ncq_tfs, uint64_t err_sector)
>>> +{
>>> +    IDEState *ide_state = &ncq_tfs->drive->port.ifs[0];
>>> +
>>> +    ide_state->error = ECC_ERR;
>>> +    ide_state->status = READY_STAT | ERR_STAT;
>>> +    ncq_tfs->drive->port_regs.scr_err |= (1 << ncq_tfs->tag);
>>> +    ncq_tfs->lba = err_sector;
>>> +    qemu_sglist_destroy(&ncq_tfs->sglist);
>>> +    ncq_tfs->used = 0;
>>> +}
>>> +
>>
>> If you are definitely very sure you only want an ide_state error
>> difference, you could just as well call ncq_err and then patch
>> ide_state->error.
>>
>> (I am not sure that's what you want, but... maybe it is?)
> 
> As I mentioned above, return an unrecoverable media error.
> 
> SCSI sense data can report the first sector in error and I thought I
> could do the same for SATA too?
> 
>> I'd have to check -- because I can't say the AHCI emulator was designed
>> so much as congealed -- but you might need calls to ncq_finish.
>>
>> usually, ncq_cb handles the return from any NCQ command and will call
>> ncq_err and ncq_finish as appropriate to tidy up the command.
>>
>> It might be a mistake that execute_ncq_command issues ncq_err in the
>> `default` arm of the switch statement without a call to finish.
>>
>> If we do call ncq_finish from this context I'm not sure if we want
>> block_acct_done here unconditionally. We may not have started a block
>> accounting operation if we never started a backend operation. Everything
>> else looks about right to me.
>>
>>
>>>  static void ncq_finish(NCQTransferState *ncq_tfs)
>>>  {
>>>      /* If we didn't error out, set our finished bit. Errored commands
>>> @@ -1065,6 +1082,8 @@ static void execute_ncq_command(NCQTransferState *ncq_tfs)
>>>  {
>>>      AHCIDevice *ad = ncq_tfs->drive;
>>>      IDEState *ide_state = &ad->port.ifs[0];
>>> +    uint64_t error_sector = 0;
>>> +    char device_id[32];
>>>      int port = ad->port_no;
>>>  
>>>      g_assert(is_ncq(ncq_tfs->cmd));
>>> @@ -1072,6 +1091,14 @@ static void execute_ncq_command(NCQTransferState *ncq_tfs)
>>>  
>>>      switch (ncq_tfs->cmd) {
>>>      case READ_FPDMA_QUEUED:
>>> +        sprintf(device_id, "%lu", ide_state->wwn);
>>
>> This seems suspicious for your design in general, but I'd like to not
>> run sprintf to a buffer in the hotpath for NCQ.
> 
> I totally agree.
> 
> I started out using integers in the call for error_in_read, as that is
> what SCSI uses internally for wwn.  Then I did NVMe and it's using a
> string that doesn't apparently need to be an integer for the wwn? so I
> changed it to being a string to accommodate.
> 
> I also find it interesting that when a SATA device wwid is dumped out
> within the guest it doesn't appear to have any correlation with the wwn
> that was passed in on the command line, eg.
> 
> -device ide-drive,drive=satadisk,bus=ahci.0,wwn=8675309
> 
> $cat /sys/block/sda/device/wwid
> t10.ATA     QEMU HARDDISK                           QM00005
> 
> 
> This does correlate for SCSI
> 
> -device scsi-hd,drive=hd,wwn=12345678
> 
> $ cat /sys/block/sdc/device/wwid
> naa.0000000000bc614e
> 
> 
> as 0xbc614e = 12345678
> 
> 
> For NVMe, the wwn is in the wwid, but it's not immediately obvious.
> 
> Being able to correlate between the command line and what you find in
> the guest would be good.
> 
> 
>> If you need this, I'd do it when wwn is set and just grab it from the
>> state structure.
>>
>>> +
>>> +        if (error_in_read(device_id, ncq_tfs->lba,
>>> +                ncq_tfs->sector_count, &error_sector)) {
>>> +            ncq_media_err(ncq_tfs, error_sector);
>>> +            return;
>>> +        }
>>> +
>>
>> One of the downsides to trying to trigger read error injections
>> per-device instead of per-node is that now you have to goof around with
>> device-specific code everywhere>
>> I suppose from your cover letter you *WANT* device-specific error
>> exercising, which would necessitate a different design from blkdebug,
>> but it means you have to add support for the framework per-device and it
>> might be very tricky to get right.
> 
> Yes, goal was to be able to selectively pick one or more specific block
> devices and then create one or more block errors on each device with
> potentially different error behavior for each block in error.
> 

Yeah -- I imagine you want to see how the kernel will respond to various
error situations posed by certain block devices themselves.

Well... In general, we support rerror=stop and werror=stop in all of our
block emulators that we care about, and you can find the choke points
that implement this by looking for e.g. 'BLOCK_ERROR_ACTION_STOP' --
these error-handling points should already be kind to the various
emulator state machines; the emulators know how to manage the error
actions correctly from this point.

I think whatever solution you end up on should look to expand these
choke points instead of trying to add new ones.

... maybe. maybe it's still too hard to weave together in that way, I'm
not sure. I only care about IDE/ATA devices, so it's just a
recommendation to try to use the existing error handling points as a
basis instead of trying to add new ones.

I think adding new ones will be hard to verify for correctness and hard
to test.

--js

> 
>>>          trace_execute_ncq_command_read(ad->hba, port, ncq_tfs->tag,
>>>                                         ncq_tfs->sector_count, ncq_tfs->lba);
>>>          dma_acct_start(ide_state->blk, &ncq_tfs->acct,
>>>
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/4] POC: Generating realistic block errors
  2019-09-20  9:22 ` Stefan Hajnoczi
@ 2019-09-20 17:28   ` Tony Asleson
  2019-11-14 15:47     ` Tony Asleson
  0 siblings, 1 reply; 30+ messages in thread
From: Tony Asleson @ 2019-09-20 17:28 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, qemu-devel

On 9/20/19 4:22 AM, Stefan Hajnoczi wrote:
> blkdebug is purely at the QEMU block layer level.  It is not aware of
> storage controller-specific error information or features.  If you want
> to inject NVMe- or SCSI-specific errors that make no sense in QEMU's
> block layer, then trying to do it in blkdebug becomes a layering
> violation.  This justifies adding a new error injection feature directly
> into AHCI, virtio-scsi, NVMe, etc devices.

Good discussion point...

In my opening use case for this POC I'm generically trying to create an
unrecoverable media error for a specific sector.  For each of the
different device types it's different on how that error is conveyed and
the associated data in transfer.

If we return EIO on a read_aio, that must be translated into transport
specific error right?  I really need to try out blkdebug and see what is
seen in the guest.  Maybe I'm upside down on the layering too :-)

> What I like about blkdebug is the state machine and relatively powerful
> tests that this enables.  It makes sense to reuse those for storage
> controller-specific error injection too.  In the future we may wish to
> reuse it for network cards and other emulated devices too!
> 
> Perhaps the error injection "engine" (the core blkdebug code) can be
> extracted and reused?

I think VMs are a great way to test all different error paths, just not
those specific to storage so if we could make this generic that would be
great.


I also elaborated about this a bit in one of my other responses, but
I'll reiterate here a bit too.  If we could design a suitable callback
interface utilizing qapi I think we could move the state machine/logic
out of QEMU all together.  So that the test code could contain this
logic which would allow all types of behavior that we haven't even
thought of to exist outside of QEMU and not require changes to QEMU to
exploit.

-Tony


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/4] POC: Generating realistic block errors
  2019-09-20 16:41   ` Tony Asleson
  2019-09-20 17:08     ` Eric Blake
@ 2019-09-20 18:11     ` Kevin Wolf
  2019-09-20 18:55       ` Tony Asleson
  1 sibling, 1 reply; 30+ messages in thread
From: Kevin Wolf @ 2019-09-20 18:11 UTC (permalink / raw)
  To: Tony Asleson; +Cc: qemu-devel

Am 20.09.2019 um 18:41 hat Tony Asleson geschrieben:
> On 9/20/19 3:36 AM, Kevin Wolf wrote:
> > I/O error inserted by blkdebug can be one-off or permanent, but since it
> > also supports using a small state machine, I think you should already be
> > able to configure your errors that are corrected by a rewrite, too, even
> > if there is no explicit support for this yet (I guess we could add it if
> > it turned out to be much easier to use).
> 
> One thing I thought about is the feasibility of having a callback for
> these errors across qapi.  For example you could register a sector for a
> read/write/both and when that operation occurs you would block IO, send
> the sector number and associated data across qapi for test code to do
> something with it and respond allowing the operation to continue
> successfully or by returning an error determined by the external test
> code to be propagated to guest.
> 
> This would allow the logic to be outside of QEMU.  So for example in the
> re-write case the test code could remove the error when it gets the
> write, instead of having that logic embedded in QEMU itself.
> 
> Thoughts?

Emitting a QMP event when blkdebug injects an error makes sense to me.

I wouldn't use it for this case, though, because this would become racy.
It could happen that the guest writes to the image, which sends a QMP
event, and then reads before the external program has removed the error.

> > The one thing I see in your series that we can't currently provide this
> > way is the exact sector number where the error happened. If you read
> > from sector 32 to 64 and there is an error configured for sector 50, you
> > just see that the whole request is failing.
> 
> Also depending on the device type the data behavior can be different
> too.  For SCSI devices I believe the specification states that the data
> leading up to the sector in error is transferred to the initiator.  For
> ATA I believe this is not true.  My code doesn't model this correctly.
> I generated the error before any data was transferred.
> 
> I'm thinking changes in blkdebug will need to be done to handle this too?

Hm... The SCSI case might get a bit tricky (if the spec does indeed say
that this is what must happen).

blkdebug can only return an I/O error for the whole request. Then the
device would ask for more information about the error and get the sector
number of the error. And then it would have to retry up to immediately
before the problematic sector.

By the way, this is an example where fixing this in the context of
blkdebug will also fix the behaviour for real I/O errors.

> > I also wonder why you had to write low-level error handling code instead
> > of calling the existing error functions. If the existing functions don't
> > give the right result in error cases, shouldn't they be fixed anyway?
> 
> I would think so too.  I'm using error constants that already exist, but
> apparently are not being used anywhere else.
> 
> 
> > And then, as John already hinted, adding code for debugging scenarios to
> > hot paths that are important for high-performance VMs that don't use any
> > debugging is less than optimal.
> 
> I agree, the POC code was experimental, but I should have done more
> effort in minimizing the run-time costs.
> 
> Additionally I think it would be good if QEMU could standardize the
> device wwn format to be consistent throughout all block device types,
> eg. uint64_t, but maybe not possible.  I also think it would be good to
> allow the wwn passed on the command line correlate with what the guest
> sees for /sys/block/<device>/device/wwid.
> 
> However, I'm assuming that QEMU has the same stance as the linux kernel
> with no visible user space breakage?

Changes that are visible to the guest break live migration, so they are
only allowed with a new option that defaults to off for existing machine
type. The default can change to on for new machine types.

> > So bringing everything together, what would you think of this plan:
> > 
> > 1. Extend blkdebug with whatever ways you need to trigger I/O errors
> >    (only if the existing modes aren't sufficient at least for the start;
> >    we can still always extend it later)
> > 
> > 2. Add a new BlockDriver callback that can return detailed information
> >    about an error (such as the exact sector number), and wire it up
> >    through BlockBackend (some blk_* function). Implement it in blkdebug.
> > 
> > 3. In the guest devices, only call the function to get detailed error
> >    information in the existing error path. You can then update some
> >    device state according to the details if the block driver returned
> >    anything (probably only blkdebug will return something).
> > 
> > This way, we have no changes at all in the hot I/O path if you didn't
> > configure your VM with a blkdebug filter. And we avoid duplication of
> > code both in the error handler in devices and in the error injection
> > mechanisms.
> 
> This all sounds good to me.  Although I'm not 100% sure of all the
> specific details you are describing at the moment as I'm not that
> familiar with the code base.

Yes, but I hope it does give you some pointers what to look at. Feel
free to ask more questions once you're ready. (Though I won't be
available next week, but there are other people, too, and I'll read it
later anyway.)

Kevin


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/4] POC: Generating realistic block errors
  2019-09-20 18:11     ` Kevin Wolf
@ 2019-09-20 18:55       ` Tony Asleson
  2019-09-30 14:54         ` Kevin Wolf
  0 siblings, 1 reply; 30+ messages in thread
From: Tony Asleson @ 2019-09-20 18:55 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel

On 9/20/19 1:11 PM, Kevin Wolf wrote:
> Emitting a QMP event when blkdebug injects an error makes sense to me.
> 
> I wouldn't use it for this case, though, because this would become racy.
> It could happen that the guest writes to the image, which sends a QMP
> event, and then reads before the external program has removed the error.

My POC had a single lock protecting it's shared state.  I'm kind of
surprised no one jumped on that because it's a big point of lock
contention.  One could argue that the state data and associated lock(s)
should be on each device which leads me to the next point.  I think with
careful locking and sequencing we could address this race condition so
that the error was removed before the write completed.  In fact it would
need to work that way to allow the external test code the ability to
perturb the data before it's written if that's what the test wanted to do.

-Tony


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/4] POC: Generating realistic block errors
  2019-09-20 17:08     ` Eric Blake
@ 2019-09-20 19:15       ` Tony Asleson
  0 siblings, 0 replies; 30+ messages in thread
From: Tony Asleson @ 2019-09-20 19:15 UTC (permalink / raw)
  To: Eric Blake, Kevin Wolf; +Cc: qemu-devel

On 9/20/19 12:08 PM, Eric Blake wrote:
> I am worried, however, that making data transactions have to go through
> QMP to get an answer on how to handle a specific guest request will slow
> things down; QMP is not built to be an efficient dataplane interface.

IMHO we only need to make the code path efficient up to the point we
know we have an IO operation that contains sector(s) that we want to do
something with.  Once that happens I don't think it will be an issue for
those to be slower.  A real spinning disk drive takes longer to handle a
request when it encounters an error too.

> If you truly want isolation (where another process receives all guest
> transactions, and makes decisions on how to handle them)

It would only handle specific sector(s) of interest, not all.

-Tony


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 4/4] ahci media error reporting
  2019-09-20 16:18       ` John Snow
@ 2019-09-20 19:25         ` Tony Asleson
  2019-09-20 19:29           ` John Snow
  0 siblings, 1 reply; 30+ messages in thread
From: Tony Asleson @ 2019-09-20 19:25 UTC (permalink / raw)
  To: John Snow, Kevin Wolf; +Cc: qemu-devel, Qemu-block

On 9/20/19 11:18 AM, John Snow wrote:
> For sure -- I handle the normative cases, but I don't test what happens
> if you issue an unsupported NCQ command. (I don't know what real
> hardware does right now, either. I'm sure I could read the spec and find
> out, but don't have a testing setup that lets me analyze real hardware
> anymore.)

Regardless of what actual hardware does, it's always good to see what
the spec says as hardware folks get it wrong too sometimes :-)

-Tony


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 4/4] ahci media error reporting
  2019-09-20 19:25         ` Tony Asleson
@ 2019-09-20 19:29           ` John Snow
  0 siblings, 0 replies; 30+ messages in thread
From: John Snow @ 2019-09-20 19:29 UTC (permalink / raw)
  To: tasleson, Kevin Wolf; +Cc: qemu-devel, Qemu-block



On 9/20/19 3:25 PM, Tony Asleson wrote:
> On 9/20/19 11:18 AM, John Snow wrote:
>> For sure -- I handle the normative cases, but I don't test what happens
>> if you issue an unsupported NCQ command. (I don't know what real
>> hardware does right now, either. I'm sure I could read the spec and find
>> out, but don't have a testing setup that lets me analyze real hardware
>> anymore.)
> 
> Regardless of what actual hardware does, it's always good to see what
> the spec says as hardware folks get it wrong too sometimes :-)
> 
> -Tony
> 

That depends. If we're emulating an "AHCI device", we should follow the
spec, but the one we instantiate in QEMU is the ICH9 version, so we try
to follow the hardware.

I do wind up looking at both, but because the relevant specs are split
across SATA, AHCI, ATA, ATAPI and into SCSI sometimes, (plus sub-specs)
it can be hard to piece together a holistic picture of what should
happen sometimes.

Looking at hardware is sometimes quicker and more definitive :)

Anyway, none of this helps you out much, but I will at least stay tuned
to review AHCI code and can try to help you make better incisions.

Enjoy your weekend!

--js


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/4] POC: Generating realistic block errors
  2019-09-20 18:55       ` Tony Asleson
@ 2019-09-30 14:54         ` Kevin Wolf
  0 siblings, 0 replies; 30+ messages in thread
From: Kevin Wolf @ 2019-09-30 14:54 UTC (permalink / raw)
  To: Tony Asleson; +Cc: qemu-devel

Am 20.09.2019 um 20:55 hat Tony Asleson geschrieben:
> On 9/20/19 1:11 PM, Kevin Wolf wrote:
> > Emitting a QMP event when blkdebug injects an error makes sense to me.
> > 
> > I wouldn't use it for this case, though, because this would become racy.
> > It could happen that the guest writes to the image, which sends a QMP
> > event, and then reads before the external program has removed the error.
> 
> My POC had a single lock protecting it's shared state.  I'm kind of
> surprised no one jumped on that because it's a big point of lock
> contention.

I think people didn't review the code in detail because we're still
discussing very high-level design questions.

Anyway, I did mention that I'd like to get your code out of the way for
the fast path when the feature isn't used. If the user explicitly
enabled the feature and we're basically in a debugging setup, the lock
contention should be acceptable. In fact, the mutex might not even be
necessary because the code should be covered by the AioContext lock.

However, I don't see how this locking could fix the race I mention. It's
not a race between two QEMU components, but between the guest and a QMP
client. A mutex in QEMU certainly feels like the wrong way to address
it.

If you really wanted an external process to control this, you would have
to fully stop the VM whenever an error is injected and only continue it
via QMP after the QMP client has decided whether or not to disable the
error. Because you'd need a custom QMP client then, you wouldn't be able
to use things like libvirt for such QEMU instances.

Kevin


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/4] POC: Generating realistic block errors
  2019-09-20 17:28   ` Tony Asleson
@ 2019-11-14 15:47     ` Tony Asleson
  2019-11-21 10:30       ` Stefan Hajnoczi
  0 siblings, 1 reply; 30+ messages in thread
From: Tony Asleson @ 2019-11-14 15:47 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, qemu-devel

On 9/20/19 12:28 PM, Tony Asleson wrote:
> On 9/20/19 4:22 AM, Stefan Hajnoczi wrote:
>> blkdebug is purely at the QEMU block layer level.  It is not aware of
>> storage controller-specific error information or features.  If you want
>> to inject NVMe- or SCSI-specific errors that make no sense in QEMU's
>> block layer, then trying to do it in blkdebug becomes a layering
>> violation.  This justifies adding a new error injection feature directly
>> into AHCI, virtio-scsi, NVMe, etc devices.
> 
> Good discussion point...
> 
> In my opening use case for this POC I'm generically trying to create an
> unrecoverable media error for a specific sector.  For each of the
> different device types it's different on how that error is conveyed and
> the associated data in transfer.
> 

I would like to get some additional clarification on this point.  Should
I be investing more time integrating my proposed functionality into
blkdebug or other?

Sorry for the long response time, got sidetracked with other stuff.

Thanks,
Tony



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/4] POC: Generating realistic block errors
  2019-11-14 15:47     ` Tony Asleson
@ 2019-11-21 10:30       ` Stefan Hajnoczi
  2019-11-21 11:12         ` Kevin Wolf
  2019-11-26 18:19         ` Tony Asleson
  0 siblings, 2 replies; 30+ messages in thread
From: Stefan Hajnoczi @ 2019-11-21 10:30 UTC (permalink / raw)
  To: Tony Asleson; +Cc: kwolf, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1302 bytes --]

On Thu, Nov 14, 2019 at 09:47:48AM -0600, Tony Asleson wrote:
> On 9/20/19 12:28 PM, Tony Asleson wrote:
> > On 9/20/19 4:22 AM, Stefan Hajnoczi wrote:
> >> blkdebug is purely at the QEMU block layer level.  It is not aware of
> >> storage controller-specific error information or features.  If you want
> >> to inject NVMe- or SCSI-specific errors that make no sense in QEMU's
> >> block layer, then trying to do it in blkdebug becomes a layering
> >> violation.  This justifies adding a new error injection feature directly
> >> into AHCI, virtio-scsi, NVMe, etc devices.
> > 
> > Good discussion point...
> > 
> > In my opening use case for this POC I'm generically trying to create an
> > unrecoverable media error for a specific sector.  For each of the
> > different device types it's different on how that error is conveyed and
> > the associated data in transfer.
> > 
> 
> I would like to get some additional clarification on this point.  Should
> I be investing more time integrating my proposed functionality into
> blkdebug or other?
> 
> Sorry for the long response time, got sidetracked with other stuff.

blkdebug can inject EIO when a specific LBA is accessed.  Is that
enough for what you want to do?  Then you can reuse and maybe extend
blkdebug.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/4] POC: Generating realistic block errors
  2019-11-21 10:30       ` Stefan Hajnoczi
@ 2019-11-21 11:12         ` Kevin Wolf
  2019-11-26 18:19         ` Tony Asleson
  1 sibling, 0 replies; 30+ messages in thread
From: Kevin Wolf @ 2019-11-21 11:12 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Tony Asleson, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1705 bytes --]

Am 21.11.2019 um 11:30 hat Stefan Hajnoczi geschrieben:
> On Thu, Nov 14, 2019 at 09:47:48AM -0600, Tony Asleson wrote:
> > On 9/20/19 12:28 PM, Tony Asleson wrote:
> > > On 9/20/19 4:22 AM, Stefan Hajnoczi wrote:
> > >> blkdebug is purely at the QEMU block layer level.  It is not aware of
> > >> storage controller-specific error information or features.  If you want
> > >> to inject NVMe- or SCSI-specific errors that make no sense in QEMU's
> > >> block layer, then trying to do it in blkdebug becomes a layering
> > >> violation.  This justifies adding a new error injection feature directly
> > >> into AHCI, virtio-scsi, NVMe, etc devices.
> > > 
> > > Good discussion point...
> > > 
> > > In my opening use case for this POC I'm generically trying to create an
> > > unrecoverable media error for a specific sector.  For each of the
> > > different device types it's different on how that error is conveyed and
> > > the associated data in transfer.
> > > 
> > 
> > I would like to get some additional clarification on this point.  Should
> > I be investing more time integrating my proposed functionality into
> > blkdebug or other?
> > 
> > Sorry for the long response time, got sidetracked with other stuff.
> 
> blkdebug can inject EIO when a specific LBA is accessed.  Is that
> enough for what you want to do?  Then you can reuse and maybe extend
> blkdebug.

And if we need something more device specific, maybe a common core can
be extracted that can be used by both blkdebug and the devices. All of
the logic of managing error injection rules and checking whether
something needs to be done at the current event should be common between
both.

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/4] POC: Generating realistic block errors
  2019-11-21 10:30       ` Stefan Hajnoczi
  2019-11-21 11:12         ` Kevin Wolf
@ 2019-11-26 18:19         ` Tony Asleson
  2019-11-26 19:28           ` Kevin Wolf
  1 sibling, 1 reply; 30+ messages in thread
From: Tony Asleson @ 2019-11-26 18:19 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, qemu-devel

On 11/21/19 4:30 AM, Stefan Hajnoczi wrote:
> blkdebug can inject EIO when a specific LBA is accessed.  Is that
> enough for what you want to do?  Then you can reuse and maybe extend
> blkdebug.

Not exactly.  For SCSI, I would like to be able to return different
types of device errors on reads eg. 03/1101, 03/1600 and writes.  The
SCSI sense data needs to include the first block in error for the
transfer.  It would be good to also have the ability to include things
like SCSI check conditions with recoverable errors too.

I've been experimenting with blkdebug, to learn more and to see how it
would need to be extended.  One thing that I was trying to understand is
how an EIO from blkdebug gets translated into a bus/device specific
error.  At the moment I'm not sure.  I've been trying to figure out the
layering.  I think that blkdebug sits between the device specific model
and the underlying block representation on disk.  Thus it injects error
return values when accessing the underlying data, but that could be
incorrect.  If it is correct I should see some code that translates the
EIO to something transport/device specific.  Although I don't understand
how returning an ENOSPC from read_aio in blkdebug would get translated
for a SCSI disk as it doesn't make sense to me (one of the examples in
the documentation).  Actually I don't know how getting ENOSPC on a read
could happen?

During my blkdebug experimentation, I've been using lsi53c895a  with
scsi-disk and thus far I've not been able to generate a read error back
to the guest kernel.  I've managed to abort qemu with an assert and hang
qemu without being able to get an error back to the guest kernel.  I
wrote up one of them: https://bugs.launchpad.net/qemu/+bug/1853898 .
Specifying a specific sector hasn't worked for me yet.  I'm still trying
to figure out how to enable tracing/debugging etc. to see what I'm going
incorrectly.

If someone can point me to any relevant docs, diagrams, talks etc. that
would be greatly appreciated.  I've been looking in the source tree docs
directory and the source code itself and things I've found from web
searches.

Thanks,
Tony




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/4] POC: Generating realistic block errors
  2019-11-26 18:19         ` Tony Asleson
@ 2019-11-26 19:28           ` Kevin Wolf
  0 siblings, 0 replies; 30+ messages in thread
From: Kevin Wolf @ 2019-11-26 19:28 UTC (permalink / raw)
  To: Tony Asleson; +Cc: Stefan Hajnoczi, qemu-devel

Am 26.11.2019 um 19:19 hat Tony Asleson geschrieben:
> On 11/21/19 4:30 AM, Stefan Hajnoczi wrote:
> > blkdebug can inject EIO when a specific LBA is accessed.  Is that
> > enough for what you want to do?  Then you can reuse and maybe extend
> > blkdebug.
> 
> Not exactly.  For SCSI, I would like to be able to return different
> types of device errors on reads eg. 03/1101, 03/1600 and writes.  The
> SCSI sense data needs to include the first block in error for the
> transfer.  It would be good to also have the ability to include things
> like SCSI check conditions with recoverable errors too.
> 
> I've been experimenting with blkdebug, to learn more and to see how it
> would need to be extended.  One thing that I was trying to understand is
> how an EIO from blkdebug gets translated into a bus/device specific
> error.  At the moment I'm not sure.  I've been trying to figure out the
> layering.  I think that blkdebug sits between the device specific model
> and the underlying block representation on disk.  Thus it injects error
> return values when accessing the underlying data, but that could be
> incorrect.  If it is correct I should see some code that translates the
> EIO to something transport/device specific.

The point where the device calls into the generic block layer is where
the functions that start with blk_ are called (blk_aio_pwritev() and
blk_aio_preadv() are probably the most interesting ones).

The callback path in scsi-disk is not that easy to follow, but in the
end, error returns should result in scsi_handle_rw_error() being called
where error codes are translated into SCSI sense codes.

> Although I don't understand how returning an ENOSPC from read_aio in
> blkdebug would get translated for a SCSI disk as it doesn't make sense
> to me (one of the examples in the documentation).  Actually I don't
> know how getting ENOSPC on a read could happen?

That scenario doesn't make a lot of sense to me either, but blkdebug can
just inject any error code, even nonsensical ones.

> During my blkdebug experimentation, I've been using lsi53c895a  with
> scsi-disk and thus far I've not been able to generate a read error back
> to the guest kernel.  I've managed to abort qemu with an assert and hang
> qemu without being able to get an error back to the guest kernel.  I
> wrote up one of them: https://bugs.launchpad.net/qemu/+bug/1853898 .
> Specifying a specific sector hasn't worked for me yet.  I'm still trying
> to figure out how to enable tracing/debugging etc. to see what I'm going
> incorrectly.

Note that depending on the rerror/werror options, QEMU may not deliver
errors to the guest, but stop VMs instead. If the monitor is still
responsive, it's likely that you just got a stopped VM rather than a
hanging QEMU.

The default is that the VM is stopped for ENOSPC and other errors are
delivered to the guest.

Kevin



^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2019-11-26 19:42 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-19 19:48 [RFC 0/4] POC: Generating realistic block errors Tony Asleson
2019-09-19 19:48 ` [RFC 1/4] Add qapi for block error injection Tony Asleson
2019-09-20  9:03   ` Philippe Mathieu-Daudé
2019-09-20 15:17     ` Tony Asleson
2019-09-19 19:48 ` [RFC 2/4] SCSI media error reporting Tony Asleson
2019-09-19 19:48 ` [RFC 3/4] NVMe " Tony Asleson
2019-09-19 19:48 ` [RFC 4/4] ahci " Tony Asleson
2019-09-19 20:43   ` John Snow
2019-09-19 21:49     ` Tony Asleson
2019-09-20 17:22       ` John Snow
2019-09-20  8:43     ` Kevin Wolf
2019-09-20 16:18       ` John Snow
2019-09-20 19:25         ` Tony Asleson
2019-09-20 19:29           ` John Snow
2019-09-20  8:36 ` [RFC 0/4] POC: Generating realistic block errors Kevin Wolf
2019-09-20 16:41   ` Tony Asleson
2019-09-20 17:08     ` Eric Blake
2019-09-20 19:15       ` Tony Asleson
2019-09-20 18:11     ` Kevin Wolf
2019-09-20 18:55       ` Tony Asleson
2019-09-30 14:54         ` Kevin Wolf
2019-09-20  9:22 ` Stefan Hajnoczi
2019-09-20 17:28   ` Tony Asleson
2019-11-14 15:47     ` Tony Asleson
2019-11-21 10:30       ` Stefan Hajnoczi
2019-11-21 11:12         ` Kevin Wolf
2019-11-26 18:19         ` Tony Asleson
2019-11-26 19:28           ` Kevin Wolf
2019-09-20  9:25 ` Stefan Hajnoczi
2019-09-20 14:41 ` no-reply

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).