[RFC 0/4] POC: Generating realistic block errors

* [RFC 0/4] POC: Generating realistic block errors
@ 2019-09-19 19:48 Tony Asleson
  2019-09-19 19:48 ` [RFC 1/4] Add qapi for block error injection Tony Asleson
                   ` (7 more replies)
  0 siblings, 8 replies; 30+ messages in thread
From: Tony Asleson @ 2019-09-19 19:48 UTC (permalink / raw)
  To: qemu-devel, kwolf

For a long time I thought that VMs could be a great way to improve OS
code quality by modifying the simulated hardware to return errors to
exercise error paths in the OSs, specifically in block devices for right now.
A number of different approaches are available within the Linux kernel, eg.
scsi-debug, dm-flakey, and others.  However, I always though it would be best to
simulate it from the hardware.  To fully exercise the entire stack.  As a
bonus it's OS agnostic for those projects that cross OSs and it's available
before the OS even boots.

This POC needs a lot more work, but it's what I have so far.  Learning QEMU
internals, plus some of the different bus types has been interesting to say
the least.  I'm most familiar with SCSI, but the others are foreign to me.
AHCI/SATA/ATA is very interesting with it's history and the associated code to
handle it's evolution.

Eventually I think it would be useful to add functionality for errors on
write paths, timeouts, and different error behaviors.  Like media error(s) that
are corrected by a re-write (simulate failed write on power loss), bit rot
injection etc.  I know a number of these can be solved different ways,
but embracing them from the VM environment seems ideal to me.  Expanding
to gather statistics on IO patterns, histograms etc. could be very
useful too.  Having the ability to start/stop information collection and
then have access to what happened and in what order could allow for
better error injection of key FS structures and software RAID solutions too.

I've recently been informed that blkdebug exists.  From a cursory investigation
it does appear have overlap, but I haven't given it a try yet.  From looking
at the code to insert my changes it appears that some of the errors I'm
generating are different than what for example an EIO on a read_aio does, but
I'm not sure.  Perhaps some of the other features that I've outlined above
already exist too in some other capacity of QEMU?

Instead of working on this more in a vacuum I'm presenting what I have.  To
gauge interest, to see if others think it's as interesting as I do.  Or perhaps,
to find out that I've been re-inventing the wheel.

I'm interested in learning what thoughts people have on this.

Thanks,
Tony

Tony Asleson (4):
  Add qapi for block error injection
  SCSI media error reporting
  NVMe media error reporting
  ahci media error reporting

 block/Makefile.objs  |   2 +-
 block/error_inject.c | 179 +++++++++++++++++++++++++++++++++++++++++++
 block/error_inject.h |  43 +++++++++++
 block/qapi.c         |  18 +++++
 hw/block/nvme.c      |   8 ++
 hw/ide/ahci.c        |  27 +++++++
 hw/scsi/scsi-disk.c  |  33 ++++++++
 include/scsi/utils.h |   4 +
 qapi/block.json      |  36 +++++++++
 scsi/utils.c         |  31 ++++++++
 10 files changed, 380 insertions(+), 1 deletion(-)
 create mode 100644 block/error_inject.c
 create mode 100644 block/error_inject.h

-- 
2.21.0

^ permalink raw reply	[flat|nested] 30+ messages in thread