qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/25] block layer: split block APIs in global state and I/O
@ 2021-10-25 10:17 Emanuele Giuseppe Esposito
  2021-10-25 10:17 ` [PATCH v4 01/25] main-loop.h: introduce qemu_in_main_thread() Emanuele Giuseppe Esposito
                   ` (27 more replies)
  0 siblings, 28 replies; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-10-25 10:17 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Emanuele Giuseppe Esposito, Richard Henderson, Markus Armbruster,
	Dr. David Alan Gilbert, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Eric Blake

Currently, block layer APIs like block-backend.h contain a mix of
functions that are either running in the main loop and under the
BQL, or are thread-safe functions and run in iothreads performing I/O.
The functions running under BQL also take care of modifying the
block graph, by using drain and/or aio_context_acquire/release.
This makes it very confusing to understand where each function
runs, and what assumptions it provided with regards to thread
safety.

We call the functions running under BQL "global state (GS) API", and
distinguish them from the thread-safe "I/O API".

The aim of this series is to split the relevant block headers in
global state and I/O sub-headers. The division will be done in
this way:
header.h will be split in header-global-state.h, header-io.h and
header-common.h. The latter will just contain the data structures
needed by header-global-state and header-io, and common helpers
that are neither in GS nor in I/O. header.h will remain for
legacy and to avoid changing all includes in all QEMU c files,
but will only include the two new headers. No function shall be
added in header.c .
Once we split all relevant headers, it will be much easier to see what
uses the AioContext lock and remove it, which is the overall main
goal of this and other series that I posted/will post.

In addition to splitting the relevant headers shown in this series,
it is also very helpful splitting the function pointers in some
block structures, to understand what runs under AioContext lock and
what doesn't. This is what patches 19-25 do.

Each function in the GS API will have an assertion, checking
that it is always running under BQL.
I/O functions are instead thread safe (or so should be), meaning
that they *can* run under BQL, but also in an iothread in another
AioContext. Therefore they do not provide any assertion, and
need to be audited manually to verify the correctness.

Adding assetions has helped finding 2 bugs already, as shown in
my series "Migration: fix missing iothread locking".

Tested this series by running unit tests, qemu-iotests and qtests
(x86_64).
Some functions in the GS API are used everywhere but not
properly tested. Therefore their assertion is never actually run in
the tests, so despite my very careful auditing, it is not impossible
to exclude that some will trigger while actually using QEMU.

Patch 1 introduces qemu_in_main_thread(), the function used in
all assertions. This had to be introduced otherwise all unit tests
would fail, since they run in the main loop but use the code in
stubs/iothread.c
Patches 2-14 and 19-25 (with the exception of patch 9, that is an additional
assert) are all structured in the same way: first we split the header
and in the next (even) patch we add assertions.
The rest of the patches ontain either both assertions and split,
or have no assertions.

Next steps once this get reviewed:
1) audit the GS API and replace the AioContext lock with drains,
or remove them when not necessary (requires further discussion).
2) [optional as it should be already the case] audit the I/O API
and check that thread safety is guaranteed

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
v3 -> v4:
* blockdev.h (patch 14): blockdev_mark_auto_del, blockdev_auto_del
  and blk_legacy_dinfo as GS API.
* add copyright header to block.h, block-io.h and block-global-state.h
* rebase on current master (c5b2f55981)

v2 -> v3:
* rename "graph API" into "global state API"
* change order of patches, block.h comes before block-backend.h
* change GS and I/O comment headers to avoid redundancy, all other
  headers refer to block-global-state.h and block-io.h
* fix typo on GS and I/O headers
* use assert instead of g_assert
* move bdrv_pwrite_sync, bdrv_block_status and bdrv_co_copy_range_{from/to}
  to the I/O API
* change assert_bdrv_graph_writable implementation, since we need
  to introduce additional drains
* remove transactions API split
* add preparation patch for blockdev.h (patch 13)
* backup-top -> copy-on-write
* change I/O comment in job.h into a better meaningful explanation
* fix all warnings given by checkpatch, mostly due to /* */ to be
  split in separate lines
* rebase on current master (c09124dcb8), and split the following new functions:
	blk_replace_bs (I/O)
	bdrv_bsc_is_data (I/O)
	bdrv_bsc_invalidate_range (I/O)
	bdrv_bsc_fill (I/O)
	bdrv_new_open_driver_opts (GS)
	blk_get_max_hw_iov (I/O)
  they are all added in patches 4 and 6.

v1 -> v2:
* remove the iothread locking bug fix, and send it as separate patch
* rename graph API -> global state API
* better documented patch 1 (qemu_in_main_thread)
* add and split all other block layer headers
* fix warnings given by checkpatch on multiline comments

Emanuele Giuseppe Esposito (25):
  main-loop.h: introduce qemu_in_main_thread()
  include/block/block: split header into I/O and global state API
  assertions for block global state API
  include/sysemu/block-backend: split header into I/O and global state
    (GS) API
  block/block-backend.c: assertions for block-backend
  include/block/block_int: split header into I/O and global state API
  assertions for block_int global state API
  block: introduce assert_bdrv_graph_writable
  include/block/blockjob_int.h: split header into I/O and GS API
  assertions for blockjob_int.h
  include/block/blockjob.h: global state API
  assertions for blockob.h global state API
  include/sysemu/blockdev.h: move drive_add and inline drive_def
  include/systemu/blockdev.h: global state API
  assertions for blockdev.h global state API
  include/block/snapshot: global state API + assertions
  block/copy-before-write.h: global state API + assertions
  block/coroutines: I/O API
  block_int-common.h: split function pointers in BlockDriver
  block_int-common.h: assertion in the callers of BlockDriver function
    pointers
  block_int-common.h: split function pointers in BdrvChildClass
  block_int-common.h: assertions in the callers of BdrvChildClass
    function pointers
  block-backend-common.h: split function pointers in BlockDevOps
  job.h: split function pointers in JobDriver
  job.h: assertions in the callers of JobDriver funcion pointers

 block.c                                     |  188 ++-
 block/backup.c                              |    1 +
 block/block-backend.c                       |  105 +-
 block/commit.c                              |    4 +
 block/copy-before-write.c                   |    2 +
 block/copy-before-write.h                   |    7 +
 block/coroutines.h                          |    6 +
 block/dirty-bitmap.c                        |    1 +
 block/io.c                                  |   37 +
 block/meson.build                           |    7 +-
 block/mirror.c                              |    4 +
 block/monitor/bitmap-qmp-cmds.c             |    6 +
 block/monitor/block-hmp-cmds.c              |    2 +-
 block/snapshot.c                            |   28 +
 block/stream.c                              |    2 +
 blockdev.c                                  |   55 +-
 blockjob.c                                  |   14 +
 include/block/block-common.h                |  389 +++++
 include/block/block-global-state.h          |  286 ++++
 include/block/block-io.h                    |  306 ++++
 include/block/block.h                       |  878 +----------
 include/block/block_int-common.h            | 1193 +++++++++++++++
 include/block/block_int-global-state.h      |  327 ++++
 include/block/block_int-io.h                |  163 ++
 include/block/block_int.h                   | 1478 +------------------
 include/block/blockjob.h                    |    9 +
 include/block/blockjob_int.h                |   28 +
 include/block/snapshot.h                    |   13 +-
 include/qemu/job.h                          |   16 +
 include/qemu/main-loop.h                    |   13 +
 include/sysemu/block-backend-common.h       |   92 ++
 include/sysemu/block-backend-global-state.h |  122 ++
 include/sysemu/block-backend-io.h           |  139 ++
 include/sysemu/block-backend.h              |  269 +---
 include/sysemu/blockdev.h                   |   24 +-
 job.c                                       |    9 +
 migration/savevm.c                          |    2 +
 softmmu/cpus.c                              |    5 +
 softmmu/qdev-monitor.c                      |    2 +
 softmmu/vl.c                                |   25 +-
 stubs/iothread-lock.c                       |    5 +
 41 files changed, 3619 insertions(+), 2643 deletions(-)
 create mode 100644 include/block/block-common.h
 create mode 100644 include/block/block-global-state.h
 create mode 100644 include/block/block-io.h
 create mode 100644 include/block/block_int-common.h
 create mode 100644 include/block/block_int-global-state.h
 create mode 100644 include/block/block_int-io.h
 create mode 100644 include/sysemu/block-backend-common.h
 create mode 100644 include/sysemu/block-backend-global-state.h
 create mode 100644 include/sysemu/block-backend-io.h

-- 
2.27.0



^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH v4 01/25] main-loop.h: introduce qemu_in_main_thread()
  2021-10-25 10:17 [PATCH v4 00/25] block layer: split block APIs in global state and I/O Emanuele Giuseppe Esposito
@ 2021-10-25 10:17 ` Emanuele Giuseppe Esposito
  2021-10-25 11:33   ` Philippe Mathieu-Daudé
  2021-10-25 10:17 ` [PATCH v4 02/25] include/block/block: split header into I/O and global state API Emanuele Giuseppe Esposito
                   ` (26 subsequent siblings)
  27 siblings, 1 reply; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-10-25 10:17 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Emanuele Giuseppe Esposito, Richard Henderson, Markus Armbruster,
	Dr. David Alan Gilbert, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Eric Blake

When invoked from the main loop, this function is the same
as qemu_mutex_iothread_locked, and returns true if the BQL is held.
When invoked from iothreads or tests, it returns true only
if the current AioContext is the Main Loop.

This essentially just extends qemu_mutex_iothread_locked to work
also in unit tests or other users like storage-daemon, that run
in the Main Loop but end up using the implementation in
stubs/iothread-lock.c.

Using qemu_mutex_iothread_locked in unit tests defaults to false
because they use the implementation in stubs/iothread-lock,
making all assertions added in next patches fail despite the
AioContext is still the main loop.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/qemu/main-loop.h | 13 +++++++++++++
 softmmu/cpus.c           |  5 +++++
 stubs/iothread-lock.c    |  5 +++++
 3 files changed, 23 insertions(+)

diff --git a/include/qemu/main-loop.h b/include/qemu/main-loop.h
index 8dbc6fcb89..6b8fa57c5d 100644
--- a/include/qemu/main-loop.h
+++ b/include/qemu/main-loop.h
@@ -245,6 +245,19 @@ AioContext *iohandler_get_aio_context(void);
  */
 bool qemu_mutex_iothread_locked(void);
 
+/**
+ * qemu_in_main_thread: same as qemu_mutex_iothread_locked when
+ * softmmu/cpus.c implementation is linked. Otherwise this function
+ * checks that the current AioContext is the global AioContext
+ * (main loop).
+ *
+ * This is useful when checking that the BQL is held, to avoid
+ * returning false when invoked by unit tests or other users like
+ * storage-daemon that end up using stubs/iothread-lock.c
+ * implementation.
+ */
+bool qemu_in_main_thread(void);
+
 /**
  * qemu_mutex_lock_iothread: Lock the main loop mutex.
  *
diff --git a/softmmu/cpus.c b/softmmu/cpus.c
index 071085f840..3f61a3c31d 100644
--- a/softmmu/cpus.c
+++ b/softmmu/cpus.c
@@ -481,6 +481,11 @@ bool qemu_mutex_iothread_locked(void)
     return iothread_locked;
 }
 
+bool qemu_in_main_thread(void)
+{
+    return qemu_mutex_iothread_locked();
+}
+
 /*
  * The BQL is taken from so many places that it is worth profiling the
  * callers directly, instead of funneling them all through a single function.
diff --git a/stubs/iothread-lock.c b/stubs/iothread-lock.c
index 5b45b7fc8b..ff7386e42c 100644
--- a/stubs/iothread-lock.c
+++ b/stubs/iothread-lock.c
@@ -6,6 +6,11 @@ bool qemu_mutex_iothread_locked(void)
     return false;
 }
 
+bool qemu_in_main_thread(void)
+{
+    return qemu_get_current_aio_context() == qemu_get_aio_context();
+}
+
 void qemu_mutex_lock_iothread_impl(const char *file, int line)
 {
 }
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 02/25] include/block/block: split header into I/O and global state API
  2021-10-25 10:17 [PATCH v4 00/25] block layer: split block APIs in global state and I/O Emanuele Giuseppe Esposito
  2021-10-25 10:17 ` [PATCH v4 01/25] main-loop.h: introduce qemu_in_main_thread() Emanuele Giuseppe Esposito
@ 2021-10-25 10:17 ` Emanuele Giuseppe Esposito
  2021-10-25 11:37   ` Philippe Mathieu-Daudé
                     ` (2 more replies)
  2021-10-25 10:17 ` [PATCH v4 03/25] assertions for block " Emanuele Giuseppe Esposito
                   ` (25 subsequent siblings)
  27 siblings, 3 replies; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-10-25 10:17 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Emanuele Giuseppe Esposito, Richard Henderson, Markus Armbruster,
	Dr. David Alan Gilbert, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Eric Blake

block.h currently contains a mix of functions:
some of them run under the BQL and modify the block layer graph,
others are instead thread-safe and perform I/O in iothreads.
It is not easy to understand which function is part of which
group (I/O vs GS), and this patch aims to clarify it.

The "GS" functions need the BQL, and often use
aio_context_acquire/release and/or drain to be sure they
can modify the graph safely.
The I/O function are instead thread safe, and can run in
any AioContext.

By splitting the header in two files, block-io.h
and block-global-state.h we have a clearer view on what
needs what kind of protection. block-common.h
contains common structures shared by both headers.

block.h is left there for legacy and to avoid changing
all includes in all c files that use the block APIs.

Assertions are added in the next patch.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block.c                            |   3 +
 block/meson.build                  |   7 +-
 include/block/block-common.h       | 389 +++++++++++++
 include/block/block-global-state.h | 286 ++++++++++
 include/block/block-io.h           | 306 ++++++++++
 include/block/block.h              | 878 +----------------------------
 6 files changed, 1012 insertions(+), 857 deletions(-)
 create mode 100644 include/block/block-common.h
 create mode 100644 include/block/block-global-state.h
 create mode 100644 include/block/block-io.h

diff --git a/block.c b/block.c
index 45f653a88b..6fdb4d7712 100644
--- a/block.c
+++ b/block.c
@@ -67,12 +67,15 @@
 
 #define NOT_DONE 0x7fffffff /* used while emulated sync operation in progress */
 
+/* Protected by BQL */
 static QTAILQ_HEAD(, BlockDriverState) graph_bdrv_states =
     QTAILQ_HEAD_INITIALIZER(graph_bdrv_states);
 
+/* Protected by BQL */
 static QTAILQ_HEAD(, BlockDriverState) all_bdrv_states =
     QTAILQ_HEAD_INITIALIZER(all_bdrv_states);
 
+/* Protected by BQL */
 static QLIST_HEAD(, BlockDriver) bdrv_drivers =
     QLIST_HEAD_INITIALIZER(bdrv_drivers);
 
diff --git a/block/meson.build b/block/meson.build
index deb73ca389..ee636e2754 100644
--- a/block/meson.build
+++ b/block/meson.build
@@ -113,8 +113,11 @@ block_ss.add(module_block_h)
 wrapper_py = find_program('../scripts/block-coroutine-wrapper.py')
 block_gen_c = custom_target('block-gen.c',
                             output: 'block-gen.c',
-                            input: files('../include/block/block.h',
-                                         'coroutines.h'),
+                            input: files(
+                                      '../include/block/block-io.h',
+                                      '../include/block/block-global-state.h',
+                                      'coroutines.h'
+                                      ),
                             command: [wrapper_py, '@OUTPUT@', '@INPUT@'])
 block_ss.add(block_gen_c)
 
diff --git a/include/block/block-common.h b/include/block/block-common.h
new file mode 100644
index 0000000000..4f1fd8de21
--- /dev/null
+++ b/include/block/block-common.h
@@ -0,0 +1,389 @@
+#ifndef BLOCK_COMMON_H
+#define BLOCK_COMMON_H
+
+#include "block/aio.h"
+#include "block/aio-wait.h"
+#include "qemu/iov.h"
+#include "qemu/coroutine.h"
+#include "block/accounting.h"
+#include "block/dirty-bitmap.h"
+#include "block/blockjob.h"
+#include "qemu/hbitmap.h"
+#include "qemu/transactions.h"
+
+/*
+ * generated_co_wrapper
+ *
+ * Function specifier, which does nothing but mark functions to be
+ * generated by scripts/block-coroutine-wrapper.py
+ *
+ * Read more in docs/devel/block-coroutine-wrapper.rst
+ */
+#define generated_co_wrapper
+
+#define BLKDBG_EVENT(child, evt) \
+    do { \
+        if (child) { \
+            bdrv_debug_event(child->bs, evt); \
+        } \
+    } while (0)
+
+/* block.c */
+typedef struct BlockDriver BlockDriver;
+typedef struct BdrvChild BdrvChild;
+typedef struct BdrvChildClass BdrvChildClass;
+
+typedef struct BlockDriverInfo {
+    /* in bytes, 0 if irrelevant */
+    int cluster_size;
+    /* offset at which the VM state can be saved (0 if not possible) */
+    int64_t vm_state_offset;
+    bool is_dirty;
+    /*
+     * True if this block driver only supports compressed writes
+     */
+    bool needs_compressed_writes;
+} BlockDriverInfo;
+
+typedef struct BlockFragInfo {
+    uint64_t allocated_clusters;
+    uint64_t total_clusters;
+    uint64_t fragmented_clusters;
+    uint64_t compressed_clusters;
+} BlockFragInfo;
+
+typedef enum {
+    BDRV_REQ_COPY_ON_READ       = 0x1,
+    BDRV_REQ_ZERO_WRITE         = 0x2,
+
+    /*
+     * The BDRV_REQ_MAY_UNMAP flag is used in write_zeroes requests to indicate
+     * that the block driver should unmap (discard) blocks if it is guaranteed
+     * that the result will read back as zeroes. The flag is only passed to the
+     * driver if the block device is opened with BDRV_O_UNMAP.
+     */
+    BDRV_REQ_MAY_UNMAP          = 0x4,
+
+    BDRV_REQ_FUA                = 0x10,
+    BDRV_REQ_WRITE_COMPRESSED   = 0x20,
+
+    /*
+     * Signifies that this write request will not change the visible disk
+     * content.
+     */
+    BDRV_REQ_WRITE_UNCHANGED    = 0x40,
+
+    /*
+     * Forces request serialisation. Use only with write requests.
+     */
+    BDRV_REQ_SERIALISING        = 0x80,
+
+    /*
+     * Execute the request only if the operation can be offloaded or otherwise
+     * be executed efficiently, but return an error instead of using a slow
+     * fallback.
+     */
+    BDRV_REQ_NO_FALLBACK        = 0x100,
+
+    /*
+     * BDRV_REQ_PREFETCH makes sense only in the context of copy-on-read
+     * (i.e., together with the BDRV_REQ_COPY_ON_READ flag or when a COR
+     * filter is involved), in which case it signals that the COR operation
+     * need not read the data into memory (qiov) but only ensure they are
+     * copied to the top layer (i.e., that COR operation is done).
+     */
+    BDRV_REQ_PREFETCH  = 0x200,
+
+    /*
+     * If we need to wait for other requests, just fail immediately. Used
+     * only together with BDRV_REQ_SERIALISING.
+     */
+    BDRV_REQ_NO_WAIT = 0x400,
+
+    /* Mask of valid flags */
+    BDRV_REQ_MASK               = 0x7ff,
+} BdrvRequestFlags;
+
+#define BDRV_O_NO_SHARE    0x0001 /* don't share permissions */
+#define BDRV_O_RDWR        0x0002
+#define BDRV_O_RESIZE      0x0004 /* request permission for resizing the node */
+#define BDRV_O_SNAPSHOT    0x0008 /* open the file read only and save
+                                     writes in a snapshot */
+#define BDRV_O_TEMPORARY   0x0010 /* delete the file after use */
+#define BDRV_O_NOCACHE     0x0020 /* do not use the host page cache */
+#define BDRV_O_NATIVE_AIO  0x0080 /* use native AIO instead of the
+                                     thread pool */
+#define BDRV_O_NO_BACKING  0x0100 /* don't open the backing file */
+#define BDRV_O_NO_FLUSH    0x0200 /* disable flushing on this disk */
+#define BDRV_O_COPY_ON_READ 0x0400 /* copy read backing sectors into image */
+#define BDRV_O_INACTIVE    0x0800  /* consistency hint for migration handoff */
+#define BDRV_O_CHECK       0x1000  /* open solely for consistency check */
+#define BDRV_O_ALLOW_RDWR  0x2000  /* allow reopen to change from r/o to r/w */
+#define BDRV_O_UNMAP       0x4000  /* execute guest UNMAP/TRIM operations */
+#define BDRV_O_PROTOCOL    0x8000  /* if no block driver is explicitly given:
+                                      select an appropriate protocol driver,
+                                      ignoring the format layer */
+#define BDRV_O_NO_IO       0x10000 /* don't initialize for I/O */
+#define BDRV_O_AUTO_RDONLY 0x20000 /* degrade to read-only if opening
+                                      read-write fails */
+#define BDRV_O_IO_URING    0x40000 /* use io_uring instead of the thread pool */
+
+#define BDRV_O_CACHE_MASK  (BDRV_O_NOCACHE | BDRV_O_NO_FLUSH)
+
+
+/* Option names of options parsed by the block layer */
+
+#define BDRV_OPT_CACHE_WB       "cache.writeback"
+#define BDRV_OPT_CACHE_DIRECT   "cache.direct"
+#define BDRV_OPT_CACHE_NO_FLUSH "cache.no-flush"
+#define BDRV_OPT_READ_ONLY      "read-only"
+#define BDRV_OPT_AUTO_READ_ONLY "auto-read-only"
+#define BDRV_OPT_DISCARD        "discard"
+#define BDRV_OPT_FORCE_SHARE    "force-share"
+
+
+#define BDRV_SECTOR_BITS   9
+#define BDRV_SECTOR_SIZE   (1ULL << BDRV_SECTOR_BITS)
+
+#define BDRV_REQUEST_MAX_SECTORS MIN_CONST(SIZE_MAX >> BDRV_SECTOR_BITS, \
+                                           INT_MAX >> BDRV_SECTOR_BITS)
+#define BDRV_REQUEST_MAX_BYTES (BDRV_REQUEST_MAX_SECTORS << BDRV_SECTOR_BITS)
+
+/*
+ * We want allow aligning requests and disk length up to any 32bit alignment
+ * and don't afraid of overflow.
+ * To achieve it, and in the same time use some pretty number as maximum disk
+ * size, let's define maximum "length" (a limit for any offset/bytes request and
+ * for disk size) to be the greatest power of 2 less than INT64_MAX.
+ */
+#define BDRV_MAX_ALIGNMENT (1L << 30)
+#define BDRV_MAX_LENGTH (QEMU_ALIGN_DOWN(INT64_MAX, BDRV_MAX_ALIGNMENT))
+
+/*
+ * Allocation status flags for bdrv_block_status() and friends.
+ *
+ * Public flags:
+ * BDRV_BLOCK_DATA: allocation for data at offset is tied to this layer
+ * BDRV_BLOCK_ZERO: offset reads as zero
+ * BDRV_BLOCK_OFFSET_VALID: an associated offset exists for accessing raw data
+ * BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
+ *                       layer rather than any backing, set by block layer
+ * BDRV_BLOCK_EOF: the returned pnum covers through end of file for this
+ *                 layer, set by block layer
+ *
+ * Internal flags:
+ * BDRV_BLOCK_RAW: for use by passthrough drivers, such as raw, to request
+ *                 that the block layer recompute the answer from the returned
+ *                 BDS; must be accompanied by just BDRV_BLOCK_OFFSET_VALID.
+ * BDRV_BLOCK_RECURSE: request that the block layer will recursively search for
+ *                     zeroes in file child of current block node inside
+ *                     returned region. Only valid together with both
+ *                     BDRV_BLOCK_DATA and BDRV_BLOCK_OFFSET_VALID. Should not
+ *                     appear with BDRV_BLOCK_ZERO.
+ *
+ * If BDRV_BLOCK_OFFSET_VALID is set, the map parameter represents the
+ * host offset within the returned BDS that is allocated for the
+ * corresponding raw guest data.  However, whether that offset
+ * actually contains data also depends on BDRV_BLOCK_DATA, as follows:
+ *
+ * DATA ZERO OFFSET_VALID
+ *  t    t        t       sectors read as zero, returned file is zero at offset
+ *  t    f        t       sectors read as valid from file at offset
+ *  f    t        t       sectors preallocated, read as zero, returned file not
+ *                        necessarily zero at offset
+ *  f    f        t       sectors preallocated but read from backing_hd,
+ *                        returned file contains garbage at offset
+ *  t    t        f       sectors preallocated, read as zero, unknown offset
+ *  t    f        f       sectors read from unknown file or offset
+ *  f    t        f       not allocated or unknown offset, read as zero
+ *  f    f        f       not allocated or unknown offset, read from backing_hd
+ */
+#define BDRV_BLOCK_DATA         0x01
+#define BDRV_BLOCK_ZERO         0x02
+#define BDRV_BLOCK_OFFSET_VALID 0x04
+#define BDRV_BLOCK_RAW          0x08
+#define BDRV_BLOCK_ALLOCATED    0x10
+#define BDRV_BLOCK_EOF          0x20
+#define BDRV_BLOCK_RECURSE      0x40
+
+typedef QTAILQ_HEAD(BlockReopenQueue, BlockReopenQueueEntry) BlockReopenQueue;
+
+typedef struct BDRVReopenState {
+    BlockDriverState *bs;
+    int flags;
+    BlockdevDetectZeroesOptions detect_zeroes;
+    bool backing_missing;
+    BlockDriverState *old_backing_bs; /* keep pointer for permissions update */
+    BlockDriverState *old_file_bs; /* keep pointer for permissions update */
+    QDict *options;
+    QDict *explicit_options;
+    void *opaque;
+} BDRVReopenState;
+
+/*
+ * Block operation types
+ */
+typedef enum BlockOpType {
+    BLOCK_OP_TYPE_BACKUP_SOURCE,
+    BLOCK_OP_TYPE_BACKUP_TARGET,
+    BLOCK_OP_TYPE_CHANGE,
+    BLOCK_OP_TYPE_COMMIT_SOURCE,
+    BLOCK_OP_TYPE_COMMIT_TARGET,
+    BLOCK_OP_TYPE_DATAPLANE,
+    BLOCK_OP_TYPE_DRIVE_DEL,
+    BLOCK_OP_TYPE_EJECT,
+    BLOCK_OP_TYPE_EXTERNAL_SNAPSHOT,
+    BLOCK_OP_TYPE_INTERNAL_SNAPSHOT,
+    BLOCK_OP_TYPE_INTERNAL_SNAPSHOT_DELETE,
+    BLOCK_OP_TYPE_MIRROR_SOURCE,
+    BLOCK_OP_TYPE_MIRROR_TARGET,
+    BLOCK_OP_TYPE_RESIZE,
+    BLOCK_OP_TYPE_STREAM,
+    BLOCK_OP_TYPE_REPLACE,
+    BLOCK_OP_TYPE_MAX,
+} BlockOpType;
+
+/* Block node permission constants */
+enum {
+    /**
+     * A user that has the "permission" of consistent reads is guaranteed that
+     * their view of the contents of the block device is complete and
+     * self-consistent, representing the contents of a disk at a specific
+     * point.
+     *
+     * For most block devices (including their backing files) this is true, but
+     * the property cannot be maintained in a few situations like for
+     * intermediate nodes of a commit block job.
+     */
+    BLK_PERM_CONSISTENT_READ    = 0x01,
+
+    /** This permission is required to change the visible disk contents. */
+    BLK_PERM_WRITE              = 0x02,
+
+    /**
+     * This permission (which is weaker than BLK_PERM_WRITE) is both enough and
+     * required for writes to the block node when the caller promises that
+     * the visible disk content doesn't change.
+     *
+     * As the BLK_PERM_WRITE permission is strictly stronger, either is
+     * sufficient to perform an unchanging write.
+     */
+    BLK_PERM_WRITE_UNCHANGED    = 0x04,
+
+    /** This permission is required to change the size of a block node. */
+    BLK_PERM_RESIZE             = 0x08,
+
+    /**
+     * This permission is required to change the node that this BdrvChild
+     * points to.
+     */
+    BLK_PERM_GRAPH_MOD          = 0x10,
+
+    BLK_PERM_ALL                = 0x1f,
+
+    DEFAULT_PERM_PASSTHROUGH    = BLK_PERM_CONSISTENT_READ
+                                 | BLK_PERM_WRITE
+                                 | BLK_PERM_WRITE_UNCHANGED
+                                 | BLK_PERM_RESIZE,
+
+    DEFAULT_PERM_UNCHANGED      = BLK_PERM_ALL & ~DEFAULT_PERM_PASSTHROUGH,
+};
+
+/*
+ * Flags that parent nodes assign to child nodes to specify what kind of
+ * role(s) they take.
+ *
+ * At least one of DATA, METADATA, FILTERED, or COW must be set for
+ * every child.
+ */
+enum BdrvChildRoleBits {
+    /*
+     * This child stores data.
+     * Any node may have an arbitrary number of such children.
+     */
+    BDRV_CHILD_DATA         = (1 << 0),
+
+    /*
+     * This child stores metadata.
+     * Any node may have an arbitrary number of metadata-storing
+     * children.
+     */
+    BDRV_CHILD_METADATA     = (1 << 1),
+
+    /*
+     * A child that always presents exactly the same visible data as
+     * the parent, e.g. by virtue of the parent forwarding all reads
+     * and writes.
+     * This flag is mutually exclusive with DATA, METADATA, and COW.
+     * Any node may have at most one filtered child at a time.
+     */
+    BDRV_CHILD_FILTERED     = (1 << 2),
+
+    /*
+     * Child from which to read all data that isn't allocated in the
+     * parent (i.e., the backing child); such data is copied to the
+     * parent through COW (and optionally COR).
+     * This field is mutually exclusive with DATA, METADATA, and
+     * FILTERED.
+     * Any node may have at most one such backing child at a time.
+     */
+    BDRV_CHILD_COW          = (1 << 3),
+
+    /*
+     * The primary child.  For most drivers, this is the child whose
+     * filename applies best to the parent node.
+     * Any node may have at most one primary child at a time.
+     */
+    BDRV_CHILD_PRIMARY      = (1 << 4),
+
+    /* Useful combination of flags */
+    BDRV_CHILD_IMAGE        = BDRV_CHILD_DATA
+                              | BDRV_CHILD_METADATA
+                              | BDRV_CHILD_PRIMARY,
+};
+
+/* Mask of BdrvChildRoleBits values */
+typedef unsigned int BdrvChildRole;
+
+typedef struct BdrvCheckResult {
+    int corruptions;
+    int leaks;
+    int check_errors;
+    int corruptions_fixed;
+    int leaks_fixed;
+    int64_t image_end_offset;
+    BlockFragInfo bfi;
+} BdrvCheckResult;
+
+typedef enum {
+    BDRV_FIX_LEAKS    = 1,
+    BDRV_FIX_ERRORS   = 2,
+} BdrvCheckMode;
+
+typedef struct BlockSizes {
+    uint32_t phys;
+    uint32_t log;
+} BlockSizes;
+
+typedef struct HDGeometry {
+    uint32_t heads;
+    uint32_t sectors;
+    uint32_t cylinders;
+} HDGeometry;
+
+/* Common functions that are neither I/O nor Global State */
+uint64_t bdrv_qapi_perm_to_blk_perm(BlockPermission qapi_perm);
+char *bdrv_perm_names(uint64_t perm);
+
+int bdrv_parse_aio(const char *mode, int *flags);
+
+int path_has_protocol(const char *path);
+int path_is_absolute(const char *path);
+char *path_combine(const char *base_path, const char *filename);
+
+char *bdrv_get_full_backing_filename_from_filename(const char *backed,
+                                                   const char *backing,
+                                                   Error **errp);
+char *bdrv_get_full_backing_filename(BlockDriverState *bs, Error **errp);
+
+#endif /* BLOCK_COMMON_H */
diff --git a/include/block/block-global-state.h b/include/block/block-global-state.h
new file mode 100644
index 0000000000..4e7775ca96
--- /dev/null
+++ b/include/block/block-global-state.h
@@ -0,0 +1,286 @@
+/*
+ * QEMU System Emulator block driver
+ *
+ * Copyright (c) 2003 Fabrice Bellard
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+#ifndef BLOCK_GLOBAL_STATE_H
+#define BLOCK_GLOBAL_STATE_H
+
+#include "block-common.h"
+
+/*
+ * Global state (GS) API. These functions run under the BQL lock.
+ *
+ * If a function modifies the graph, it also uses drain and/or
+ * aio_context_acquire/release to be sure it has unique access.
+ * aio_context locking is needed together with BQL because of
+ * the thread-safe I/O API that concurrently runs and accesses
+ * the graph without the BQL.
+ *
+ * It is important to note that not all of these functions are
+ * necessarily limited to running under the BQL, but they would
+ * require additional auditing and many small thread-safety changes
+ * to move them into the I/O API. Often it's not worth doing that
+ * work since the APIs are only used with the BQL held at the
+ * moment, so they have been placed in the GS API (for now).
+ *
+ * All functions in this header must use this assertion:
+ * assert(qemu_in_main_thread());
+ * to catch when they are accidentally called without the BQL.
+ */
+
+/* disk I/O throttling */
+void bdrv_init(void);
+void bdrv_init_with_whitelist(void);
+bool bdrv_uses_whitelist(void);
+int bdrv_is_whitelisted(BlockDriver *drv, bool read_only);
+BlockDriver *bdrv_find_protocol(const char *filename,
+                                bool allow_protocol_prefix,
+                                Error **errp);
+BlockDriver *bdrv_find_format(const char *format_name);
+int bdrv_create(BlockDriver *drv, const char* filename,
+                QemuOpts *opts, Error **errp);
+int bdrv_create_file(const char *filename, QemuOpts *opts, Error **errp);
+
+BlockDriverState *bdrv_new(void);
+int bdrv_append(BlockDriverState *bs_new, BlockDriverState *bs_top,
+                Error **errp);
+int bdrv_replace_node(BlockDriverState *from, BlockDriverState *to,
+                      Error **errp);
+BlockDriverState *bdrv_insert_node(BlockDriverState *bs, QDict *node_options,
+                                   int flags, Error **errp);
+int bdrv_drop_filter(BlockDriverState *bs, Error **errp);
+
+int bdrv_parse_cache_mode(const char *mode, int *flags, bool *writethrough);
+int bdrv_parse_discard_flags(const char *mode, int *flags);
+BdrvChild *bdrv_open_child(const char *filename,
+                           QDict *options, const char *bdref_key,
+                           BlockDriverState *parent,
+                           const BdrvChildClass *child_class,
+                           BdrvChildRole child_role,
+                           bool allow_none, Error **errp);
+BlockDriverState *bdrv_open_blockdev_ref(BlockdevRef *ref, Error **errp);
+int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
+                        Error **errp);
+int bdrv_open_backing_file(BlockDriverState *bs, QDict *parent_options,
+                           const char *bdref_key, Error **errp);
+BlockDriverState *bdrv_open(const char *filename, const char *reference,
+                            QDict *options, int flags, Error **errp);
+BlockDriverState *bdrv_new_open_driver_opts(BlockDriver *drv,
+                                            const char *node_name,
+                                            QDict *options, int flags,
+                                            Error **errp);
+BlockDriverState *bdrv_new_open_driver(BlockDriver *drv, const char *node_name,
+                                       int flags, Error **errp);
+BlockReopenQueue *bdrv_reopen_queue(BlockReopenQueue *bs_queue,
+                                    BlockDriverState *bs, QDict *options,
+                                    bool keep_old_opts);
+void bdrv_reopen_queue_free(BlockReopenQueue *bs_queue);
+int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, Error **errp);
+int bdrv_reopen(BlockDriverState *bs, QDict *opts, bool keep_old_opts,
+                Error **errp);
+int bdrv_reopen_set_read_only(BlockDriverState *bs, bool read_only,
+                              Error **errp);
+int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags);
+BlockDriverState *bdrv_find_backing_image(BlockDriverState *bs,
+                                          const char *backing_file);
+void bdrv_refresh_filename(BlockDriverState *bs);
+void bdrv_refresh_limits(BlockDriverState *bs, Transaction *tran, Error **errp);
+int bdrv_commit(BlockDriverState *bs);
+int bdrv_make_empty(BdrvChild *c, Error **errp);
+int bdrv_change_backing_file(BlockDriverState *bs, const char *backing_file,
+                             const char *backing_fmt, bool warn);
+void bdrv_register(BlockDriver *bdrv);
+int bdrv_drop_intermediate(BlockDriverState *top, BlockDriverState *base,
+                           const char *backing_file_str);
+BlockDriverState *bdrv_find_overlay(BlockDriverState *active,
+                                    BlockDriverState *bs);
+BlockDriverState *bdrv_find_base(BlockDriverState *bs);
+bool bdrv_is_backing_chain_frozen(BlockDriverState *bs, BlockDriverState *base,
+                                  Error **errp);
+int bdrv_freeze_backing_chain(BlockDriverState *bs, BlockDriverState *base,
+                              Error **errp);
+void bdrv_unfreeze_backing_chain(BlockDriverState *bs, BlockDriverState *base);
+
+/*
+ * The units of offset and total_work_size may be chosen arbitrarily by the
+ * block driver; total_work_size may change during the course of the amendment
+ * operation
+ */
+typedef void BlockDriverAmendStatusCB(BlockDriverState *bs, int64_t offset,
+                                      int64_t total_work_size, void *opaque);
+int bdrv_amend_options(BlockDriverState *bs_new, QemuOpts *opts,
+                       BlockDriverAmendStatusCB *status_cb, void *cb_opaque,
+                       bool force,
+                       Error **errp);
+
+/* check if a named node can be replaced when doing drive-mirror */
+BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs,
+                                        const char *node_name, Error **errp);
+
+/* async block I/O */
+void bdrv_aio_cancel(BlockAIOCB *acb);
+void bdrv_aio_cancel_async(BlockAIOCB *acb);
+void bdrv_invalidate_cache_all(Error **errp);
+int bdrv_inactivate_all(void);
+
+int bdrv_flush_all(void);
+void bdrv_close_all(void);
+void bdrv_drain(BlockDriverState *bs);
+void coroutine_fn bdrv_co_drain(BlockDriverState *bs);
+void bdrv_drain_all_begin(void);
+void bdrv_drain_all_end(void);
+void bdrv_drain_all(void);
+
+#define BDRV_POLL_WHILE(bs, cond) ({                       \
+    BlockDriverState *bs_ = (bs);                          \
+    AIO_WAIT_WHILE(bdrv_get_aio_context(bs_),              \
+                   cond); })
+
+int bdrv_has_zero_init_1(BlockDriverState *bs);
+int bdrv_has_zero_init(BlockDriverState *bs);
+void bdrv_lock_medium(BlockDriverState *bs, bool locked);
+void bdrv_eject(BlockDriverState *bs, bool eject_flag);
+BlockDriverState *bdrv_find_node(const char *node_name);
+BlockDeviceInfoList *bdrv_named_nodes_list(bool flat, Error **errp);
+XDbgBlockGraph *bdrv_get_xdbg_block_graph(Error **errp);
+BlockDriverState *bdrv_lookup_bs(const char *device,
+                                 const char *node_name,
+                                 Error **errp);
+bool bdrv_chain_contains(BlockDriverState *top, BlockDriverState *base);
+BlockDriverState *bdrv_next_node(BlockDriverState *bs);
+BlockDriverState *bdrv_next_all_states(BlockDriverState *bs);
+
+typedef struct BdrvNextIterator {
+    enum {
+        BDRV_NEXT_BACKEND_ROOTS,
+        BDRV_NEXT_MONITOR_OWNED,
+    } phase;
+    BlockBackend *blk;
+    BlockDriverState *bs;
+} BdrvNextIterator;
+
+BlockDriverState *bdrv_first(BdrvNextIterator *it);
+BlockDriverState *bdrv_next(BdrvNextIterator *it);
+void bdrv_next_cleanup(BdrvNextIterator *it);
+
+BlockDriverState *bdrv_next_monitor_owned(BlockDriverState *bs);
+void bdrv_iterate_format(void (*it)(void *opaque, const char *name),
+                         void *opaque, bool read_only);
+const char *bdrv_get_device_name(const BlockDriverState *bs);
+const char *bdrv_get_device_or_node_name(const BlockDriverState *bs);
+int bdrv_get_flags(BlockDriverState *bs);
+char *bdrv_dirname(BlockDriverState *bs, Error **errp);
+
+int bdrv_save_vmstate(BlockDriverState *bs, const uint8_t *buf,
+                      int64_t pos, int size);
+
+int bdrv_load_vmstate(BlockDriverState *bs, uint8_t *buf,
+                      int64_t pos, int size);
+
+void bdrv_img_create(const char *filename, const char *fmt,
+                     const char *base_filename, const char *base_fmt,
+                     char *options, uint64_t img_size, int flags,
+                     bool quiet, Error **errp);
+
+void bdrv_ref(BlockDriverState *bs);
+void bdrv_unref(BlockDriverState *bs);
+void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child);
+BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
+                             BlockDriverState *child_bs,
+                             const char *child_name,
+                             const BdrvChildClass *child_class,
+                             BdrvChildRole child_role,
+                             Error **errp);
+
+bool bdrv_op_is_blocked(BlockDriverState *bs, BlockOpType op, Error **errp);
+void bdrv_op_block(BlockDriverState *bs, BlockOpType op, Error *reason);
+void bdrv_op_unblock(BlockDriverState *bs, BlockOpType op, Error *reason);
+void bdrv_op_block_all(BlockDriverState *bs, Error *reason);
+void bdrv_op_unblock_all(BlockDriverState *bs, Error *reason);
+bool bdrv_op_blocker_is_empty(BlockDriverState *bs);
+
+int bdrv_debug_breakpoint(BlockDriverState *bs, const char *event,
+                           const char *tag);
+int bdrv_debug_remove_breakpoint(BlockDriverState *bs, const char *tag);
+int bdrv_debug_resume(BlockDriverState *bs, const char *tag);
+bool bdrv_debug_is_suspended(BlockDriverState *bs, const char *tag);
+
+/**
+ * Locks the AioContext of @bs if it's not the current AioContext. This avoids
+ * double locking which could lead to deadlocks: This is a coroutine_fn, so we
+ * know we already own the lock of the current AioContext.
+ *
+ * May only be called in the main thread.
+ */
+void coroutine_fn bdrv_co_lock(BlockDriverState *bs);
+
+/**
+ * Unlocks the AioContext of @bs if it's not the current AioContext.
+ */
+void coroutine_fn bdrv_co_unlock(BlockDriverState *bs);
+
+void bdrv_set_aio_context_ignore(BlockDriverState *bs,
+                                 AioContext *new_context, GSList **ignore);
+int bdrv_try_set_aio_context(BlockDriverState *bs, AioContext *ctx,
+                             Error **errp);
+int bdrv_child_try_set_aio_context(BlockDriverState *bs, AioContext *ctx,
+                                   BdrvChild *ignore_child, Error **errp);
+bool bdrv_child_can_set_aio_context(BdrvChild *c, AioContext *ctx,
+                                    GSList **ignore, Error **errp);
+bool bdrv_can_set_aio_context(BlockDriverState *bs, AioContext *ctx,
+                              GSList **ignore, Error **errp);
+
+int bdrv_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz);
+int bdrv_probe_geometry(BlockDriverState *bs, HDGeometry *geo);
+
+/**
+ * bdrv_drained_end_no_poll:
+ *
+ * Same as bdrv_drained_end(), but do not poll for the subgraph to
+ * actually become unquiesced.  Therefore, no graph changes will occur
+ * with this function.
+ *
+ * *drained_end_counter is incremented for every background operation
+ * that is scheduled, and will be decremented for every operation once
+ * it settles.  The caller must poll until it reaches 0.  The counter
+ * should be accessed using atomic operations only.
+ */
+void bdrv_drained_end_no_poll(BlockDriverState *bs, int *drained_end_counter);
+
+void bdrv_add_child(BlockDriverState *parent, BlockDriverState *child,
+                    Error **errp);
+void bdrv_del_child(BlockDriverState *parent, BdrvChild *child, Error **errp);
+
+/**
+ *
+ * bdrv_register_buf/bdrv_unregister_buf:
+ *
+ * Register/unregister a buffer for I/O. For example, VFIO drivers are
+ * interested to know the memory areas that would later be used for I/O, so
+ * that they can prepare IOMMU mapping etc., to get better performance.
+ */
+void bdrv_register_buf(BlockDriverState *bs, void *host, size_t size);
+void bdrv_unregister_buf(BlockDriverState *bs, void *host);
+
+void bdrv_cancel_in_flight(BlockDriverState *bs);
+
+#endif /* BLOCK_GLOBAL_STATE_H */
diff --git a/include/block/block-io.h b/include/block/block-io.h
new file mode 100644
index 0000000000..9af4609ccb
--- /dev/null
+++ b/include/block/block-io.h
@@ -0,0 +1,306 @@
+/*
+ * QEMU System Emulator block driver
+ *
+ * Copyright (c) 2003 Fabrice Bellard
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+#ifndef BLOCK_IO_H
+#define BLOCK_IO_H
+
+#include "block-common.h"
+
+/*
+ * I/O API functions. These functions are thread-safe, and therefore
+ * can run in any thread as long as the thread has called
+ * aio_context_acquire/release().
+ */
+
+int bdrv_replace_child_bs(BdrvChild *child, BlockDriverState *new_bs,
+                          Error **errp);
+
+int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
+                       int64_t bytes, BdrvRequestFlags flags);
+int bdrv_pread(BdrvChild *child, int64_t offset, void *buf, int64_t bytes);
+int bdrv_pwrite(BdrvChild *child, int64_t offset, const void *buf,
+                int64_t bytes);
+int bdrv_pwrite_sync(BdrvChild *child, int64_t offset,
+                     const void *buf, int64_t bytes);
+/*
+ * Efficiently zero a region of the disk image.  Note that this is a regular
+ * I/O request like read or write and should have a reasonable size.  This
+ * function is not suitable for zeroing the entire image in a single request
+ * because it may allocate memory for the entire region.
+ */
+int coroutine_fn bdrv_co_pwrite_zeroes(BdrvChild *child, int64_t offset,
+                                       int64_t bytes, BdrvRequestFlags flags);
+
+int coroutine_fn bdrv_co_truncate(BdrvChild *child, int64_t offset, bool exact,
+                                  PreallocMode prealloc, BdrvRequestFlags flags,
+                                  Error **errp);
+int generated_co_wrapper
+bdrv_truncate(BdrvChild *child, int64_t offset, bool exact,
+              PreallocMode prealloc, BdrvRequestFlags flags, Error **errp);
+
+int64_t bdrv_nb_sectors(BlockDriverState *bs);
+int64_t bdrv_getlength(BlockDriverState *bs);
+int64_t bdrv_get_allocated_file_size(BlockDriverState *bs);
+BlockMeasureInfo *bdrv_measure(BlockDriver *drv, QemuOpts *opts,
+                               BlockDriverState *in_bs, Error **errp);
+void bdrv_get_geometry(BlockDriverState *bs, uint64_t *nb_sectors_ptr);
+int coroutine_fn bdrv_co_delete_file(BlockDriverState *bs, Error **errp);
+void coroutine_fn bdrv_co_delete_file_noerr(BlockDriverState *bs);
+
+
+int generated_co_wrapper bdrv_check(BlockDriverState *bs, BdrvCheckResult *res,
+                                    BdrvCheckMode fix);
+
+/* sg packet commands */
+int bdrv_co_ioctl(BlockDriverState *bs, int req, void *buf);
+
+/* Invalidate any cached metadata used by image formats */
+int generated_co_wrapper bdrv_invalidate_cache(BlockDriverState *bs,
+                                               Error **errp);
+
+/* Ensure contents are flushed to disk.  */
+int generated_co_wrapper bdrv_flush(BlockDriverState *bs);
+int coroutine_fn bdrv_co_flush(BlockDriverState *bs);
+
+int generated_co_wrapper bdrv_pdiscard(BdrvChild *child, int64_t offset,
+                                       int64_t bytes);
+int bdrv_co_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes);
+bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs);
+int bdrv_block_status(BlockDriverState *bs, int64_t offset,
+                      int64_t bytes, int64_t *pnum, int64_t *map,
+                      BlockDriverState **file);
+int bdrv_block_status_above(BlockDriverState *bs, BlockDriverState *base,
+                            int64_t offset, int64_t bytes, int64_t *pnum,
+                            int64_t *map, BlockDriverState **file);
+int bdrv_is_allocated(BlockDriverState *bs, int64_t offset, int64_t bytes,
+                      int64_t *pnum);
+int bdrv_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
+                            bool include_base, int64_t offset, int64_t bytes,
+                            int64_t *pnum);
+int coroutine_fn bdrv_co_is_zero_fast(BlockDriverState *bs, int64_t offset,
+                                      int64_t bytes);
+
+bool bdrv_is_read_only(BlockDriverState *bs);
+int bdrv_can_set_read_only(BlockDriverState *bs, bool read_only,
+                           bool ignore_allow_rdw, Error **errp);
+int bdrv_apply_auto_read_only(BlockDriverState *bs, const char *errmsg,
+                              Error **errp);
+bool bdrv_is_writable(BlockDriverState *bs);
+bool bdrv_is_sg(BlockDriverState *bs);
+bool bdrv_is_inserted(BlockDriverState *bs);
+const char *bdrv_get_format_name(BlockDriverState *bs);
+
+bool bdrv_supports_compressed_writes(BlockDriverState *bs);
+const char *bdrv_get_node_name(const BlockDriverState *bs);
+int bdrv_get_info(BlockDriverState *bs, BlockDriverInfo *bdi);
+ImageInfoSpecific *bdrv_get_specific_info(BlockDriverState *bs,
+                                          Error **errp);
+BlockStatsSpecific *bdrv_get_specific_stats(BlockDriverState *bs);
+void bdrv_round_to_clusters(BlockDriverState *bs,
+                            int64_t offset, int64_t bytes,
+                            int64_t *cluster_offset,
+                            int64_t *cluster_bytes);
+
+void bdrv_get_backing_filename(BlockDriverState *bs,
+                               char *filename, int filename_size);
+
+int generated_co_wrapper
+bdrv_readv_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
+int generated_co_wrapper
+bdrv_writev_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
+
+/*
+ * Returns the alignment in bytes that is required so that no bounce buffer
+ * is required throughout the stack
+ */
+size_t bdrv_min_mem_align(BlockDriverState *bs);
+/* Returns optimal alignment in bytes for bounce buffer */
+size_t bdrv_opt_mem_align(BlockDriverState *bs);
+void *qemu_blockalign(BlockDriverState *bs, size_t size);
+void *qemu_blockalign0(BlockDriverState *bs, size_t size);
+void *qemu_try_blockalign(BlockDriverState *bs, size_t size);
+void *qemu_try_blockalign0(BlockDriverState *bs, size_t size);
+bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov);
+
+void bdrv_enable_copy_on_read(BlockDriverState *bs);
+void bdrv_disable_copy_on_read(BlockDriverState *bs);
+
+void bdrv_debug_event(BlockDriverState *bs, BlkdebugEvent event);
+
+#define BLKDBG_EVENT(child, evt) \
+    do { \
+        if (child) { \
+            bdrv_debug_event(child->bs, evt); \
+        } \
+    } while (0)
+
+/**
+ * bdrv_get_aio_context:
+ *
+ * Returns: the currently bound #AioContext
+ */
+AioContext *bdrv_get_aio_context(BlockDriverState *bs);
+/**
+ * Move the current coroutine to the AioContext of @bs and return the old
+ * AioContext of the coroutine. Increase bs->in_flight so that draining @bs
+ * will wait for the operation to proceed until the corresponding
+ * bdrv_co_leave().
+ *
+ * Consequently, you can't call drain inside a bdrv_co_enter/leave() section as
+ * this will deadlock.
+ */
+AioContext *coroutine_fn bdrv_co_enter(BlockDriverState *bs);
+
+/**
+ * Ends a section started by bdrv_co_enter(). Move the current coroutine back
+ * to old_ctx and decrease bs->in_flight again.
+ */
+void coroutine_fn bdrv_co_leave(BlockDriverState *bs, AioContext *old_ctx);
+
+/**
+ * Transfer control to @co in the aio context of @bs
+ */
+void bdrv_coroutine_enter(BlockDriverState *bs, Coroutine *co);
+
+AioContext *bdrv_child_get_parent_aio_context(BdrvChild *c);
+AioContext *child_of_bds_get_parent_aio_context(BdrvChild *c);
+
+void bdrv_io_plug(BlockDriverState *bs);
+void bdrv_io_unplug(BlockDriverState *bs);
+
+/**
+ * bdrv_parent_drained_begin_single:
+ *
+ * Begin a quiesced section for the parent of @c. If @poll is true, wait for
+ * any pending activity to cease.
+ */
+void bdrv_parent_drained_begin_single(BdrvChild *c, bool poll);
+
+/**
+ * bdrv_parent_drained_end_single:
+ *
+ * End a quiesced section for the parent of @c.
+ *
+ * This polls @bs's AioContext until all scheduled sub-drained_ends
+ * have settled, which may result in graph changes.
+ */
+void bdrv_parent_drained_end_single(BdrvChild *c);
+
+/**
+ * bdrv_drain_poll:
+ *
+ * Poll for pending requests in @bs, its parents (except for @ignore_parent),
+ * and if @recursive is true its children as well (used for subtree drain).
+ *
+ * If @ignore_bds_parents is true, parents that are BlockDriverStates must
+ * ignore the drain request because they will be drained separately (used for
+ * drain_all).
+ *
+ * This is part of bdrv_drained_begin.
+ */
+bool bdrv_drain_poll(BlockDriverState *bs, bool recursive,
+                     BdrvChild *ignore_parent, bool ignore_bds_parents);
+
+/**
+ * bdrv_drained_begin:
+ *
+ * Begin a quiesced section for exclusive access to the BDS, by disabling
+ * external request sources including NBD server and device model. Note that
+ * this doesn't block timers or coroutines from submitting more requests, which
+ * means block_job_pause is still necessary.
+ *
+ * This function can be recursive.
+ */
+void bdrv_drained_begin(BlockDriverState *bs);
+
+/**
+ * bdrv_do_drained_begin_quiesce:
+ *
+ * Quiesces a BDS like bdrv_drained_begin(), but does not wait for already
+ * running requests to complete.
+ */
+void bdrv_do_drained_begin_quiesce(BlockDriverState *bs,
+                                   BdrvChild *parent, bool ignore_bds_parents);
+
+/**
+ * Like bdrv_drained_begin, but recursively begins a quiesced section for
+ * exclusive access to all child nodes as well.
+ */
+void bdrv_subtree_drained_begin(BlockDriverState *bs);
+
+/**
+ * bdrv_drained_end:
+ *
+ * End a quiescent section started by bdrv_drained_begin().
+ *
+ * This polls @bs's AioContext until all scheduled sub-drained_ends
+ * have settled.  On one hand, that may result in graph changes.  On
+ * the other, this requires that the caller either runs in the main
+ * loop; or that all involved nodes (@bs and all of its parents) are
+ * in the caller's AioContext.
+ */
+void bdrv_drained_end(BlockDriverState *bs);
+
+/**
+ * End a quiescent section started by bdrv_subtree_drained_begin().
+ */
+void bdrv_subtree_drained_end(BlockDriverState *bs);
+
+bool bdrv_can_store_new_dirty_bitmap(BlockDriverState *bs, const char *name,
+                                     uint32_t granularity, Error **errp);
+
+/**
+ *
+ * bdrv_co_copy_range:
+ *
+ * Do offloaded copy between two children. If the operation is not implemented
+ * by the driver, or if the backend storage doesn't support it, a negative
+ * error code will be returned.
+ *
+ * Note: block layer doesn't emulate or fallback to a bounce buffer approach
+ * because usually the caller shouldn't attempt offloaded copy any more (e.g.
+ * calling copy_file_range(2)) after the first error, thus it should fall back
+ * to a read+write path in the caller level.
+ *
+ * @src: Source child to copy data from
+ * @src_offset: offset in @src image to read data
+ * @dst: Destination child to copy data to
+ * @dst_offset: offset in @dst image to write data
+ * @bytes: number of bytes to copy
+ * @flags: request flags. Supported flags:
+ *         BDRV_REQ_ZERO_WRITE - treat the @src range as zero data and do zero
+ *                               write on @dst as if bdrv_co_pwrite_zeroes is
+ *                               called. Used to simplify caller code, or
+ *                               during BlockDriver.bdrv_co_copy_range_from()
+ *                               recursion.
+ *         BDRV_REQ_NO_SERIALISING - do not serialize with other overlapping
+ *                                   requests currently in flight.
+ *
+ * Returns: 0 if succeeded; negative error code if failed.
+ **/
+int coroutine_fn bdrv_co_copy_range(BdrvChild *src, int64_t src_offset,
+                                    BdrvChild *dst, int64_t dst_offset,
+                                    int64_t bytes, BdrvRequestFlags read_flags,
+                                    BdrvRequestFlags write_flags);
+
+#endif /* BLOCK_IO_H */
diff --git a/include/block/block.h b/include/block/block.h
index e5dd22b034..1e6b8fef1e 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -1,864 +1,32 @@
-#ifndef BLOCK_H
-#define BLOCK_H
-
-#include "block/aio.h"
-#include "block/aio-wait.h"
-#include "qemu/iov.h"
-#include "qemu/coroutine.h"
-#include "block/accounting.h"
-#include "block/dirty-bitmap.h"
-#include "block/blockjob.h"
-#include "qemu/hbitmap.h"
-#include "qemu/transactions.h"
-
 /*
- * generated_co_wrapper
- *
- * Function specifier, which does nothing but mark functions to be
- * generated by scripts/block-coroutine-wrapper.py
- *
- * Read more in docs/devel/block-coroutine-wrapper.rst
- */
-#define generated_co_wrapper
-
-/* block.c */
-typedef struct BlockDriver BlockDriver;
-typedef struct BdrvChild BdrvChild;
-typedef struct BdrvChildClass BdrvChildClass;
-
-typedef struct BlockDriverInfo {
-    /* in bytes, 0 if irrelevant */
-    int cluster_size;
-    /* offset at which the VM state can be saved (0 if not possible) */
-    int64_t vm_state_offset;
-    bool is_dirty;
-    /*
-     * True if this block driver only supports compressed writes
-     */
-    bool needs_compressed_writes;
-} BlockDriverInfo;
-
-typedef struct BlockFragInfo {
-    uint64_t allocated_clusters;
-    uint64_t total_clusters;
-    uint64_t fragmented_clusters;
-    uint64_t compressed_clusters;
-} BlockFragInfo;
-
-typedef enum {
-    BDRV_REQ_COPY_ON_READ       = 0x1,
-    BDRV_REQ_ZERO_WRITE         = 0x2,
-
-    /*
-     * The BDRV_REQ_MAY_UNMAP flag is used in write_zeroes requests to indicate
-     * that the block driver should unmap (discard) blocks if it is guaranteed
-     * that the result will read back as zeroes. The flag is only passed to the
-     * driver if the block device is opened with BDRV_O_UNMAP.
-     */
-    BDRV_REQ_MAY_UNMAP          = 0x4,
-
-    BDRV_REQ_FUA                = 0x10,
-    BDRV_REQ_WRITE_COMPRESSED   = 0x20,
-
-    /* Signifies that this write request will not change the visible disk
-     * content. */
-    BDRV_REQ_WRITE_UNCHANGED    = 0x40,
-
-    /* Forces request serialisation. Use only with write requests. */
-    BDRV_REQ_SERIALISING        = 0x80,
-
-    /* Execute the request only if the operation can be offloaded or otherwise
-     * be executed efficiently, but return an error instead of using a slow
-     * fallback. */
-    BDRV_REQ_NO_FALLBACK        = 0x100,
-
-    /*
-     * BDRV_REQ_PREFETCH makes sense only in the context of copy-on-read
-     * (i.e., together with the BDRV_REQ_COPY_ON_READ flag or when a COR
-     * filter is involved), in which case it signals that the COR operation
-     * need not read the data into memory (qiov) but only ensure they are
-     * copied to the top layer (i.e., that COR operation is done).
-     */
-    BDRV_REQ_PREFETCH  = 0x200,
-
-    /*
-     * If we need to wait for other requests, just fail immediately. Used
-     * only together with BDRV_REQ_SERIALISING.
-     */
-    BDRV_REQ_NO_WAIT = 0x400,
-
-    /* Mask of valid flags */
-    BDRV_REQ_MASK               = 0x7ff,
-} BdrvRequestFlags;
-
-typedef struct BlockSizes {
-    uint32_t phys;
-    uint32_t log;
-} BlockSizes;
-
-typedef struct HDGeometry {
-    uint32_t heads;
-    uint32_t sectors;
-    uint32_t cylinders;
-} HDGeometry;
-
-#define BDRV_O_NO_SHARE    0x0001 /* don't share permissions */
-#define BDRV_O_RDWR        0x0002
-#define BDRV_O_RESIZE      0x0004 /* request permission for resizing the node */
-#define BDRV_O_SNAPSHOT    0x0008 /* open the file read only and save writes in a snapshot */
-#define BDRV_O_TEMPORARY   0x0010 /* delete the file after use */
-#define BDRV_O_NOCACHE     0x0020 /* do not use the host page cache */
-#define BDRV_O_NATIVE_AIO  0x0080 /* use native AIO instead of the thread pool */
-#define BDRV_O_NO_BACKING  0x0100 /* don't open the backing file */
-#define BDRV_O_NO_FLUSH    0x0200 /* disable flushing on this disk */
-#define BDRV_O_COPY_ON_READ 0x0400 /* copy read backing sectors into image */
-#define BDRV_O_INACTIVE    0x0800  /* consistency hint for migration handoff */
-#define BDRV_O_CHECK       0x1000  /* open solely for consistency check */
-#define BDRV_O_ALLOW_RDWR  0x2000  /* allow reopen to change from r/o to r/w */
-#define BDRV_O_UNMAP       0x4000  /* execute guest UNMAP/TRIM operations */
-#define BDRV_O_PROTOCOL    0x8000  /* if no block driver is explicitly given:
-                                      select an appropriate protocol driver,
-                                      ignoring the format layer */
-#define BDRV_O_NO_IO       0x10000 /* don't initialize for I/O */
-#define BDRV_O_AUTO_RDONLY 0x20000 /* degrade to read-only if opening read-write fails */
-#define BDRV_O_IO_URING    0x40000 /* use io_uring instead of the thread pool */
-
-#define BDRV_O_CACHE_MASK  (BDRV_O_NOCACHE | BDRV_O_NO_FLUSH)
-
-
-/* Option names of options parsed by the block layer */
-
-#define BDRV_OPT_CACHE_WB       "cache.writeback"
-#define BDRV_OPT_CACHE_DIRECT   "cache.direct"
-#define BDRV_OPT_CACHE_NO_FLUSH "cache.no-flush"
-#define BDRV_OPT_READ_ONLY      "read-only"
-#define BDRV_OPT_AUTO_READ_ONLY "auto-read-only"
-#define BDRV_OPT_DISCARD        "discard"
-#define BDRV_OPT_FORCE_SHARE    "force-share"
-
-
-#define BDRV_SECTOR_BITS   9
-#define BDRV_SECTOR_SIZE   (1ULL << BDRV_SECTOR_BITS)
-
-#define BDRV_REQUEST_MAX_SECTORS MIN_CONST(SIZE_MAX >> BDRV_SECTOR_BITS, \
-                                           INT_MAX >> BDRV_SECTOR_BITS)
-#define BDRV_REQUEST_MAX_BYTES (BDRV_REQUEST_MAX_SECTORS << BDRV_SECTOR_BITS)
-
-/*
- * We want allow aligning requests and disk length up to any 32bit alignment
- * and don't afraid of overflow.
- * To achieve it, and in the same time use some pretty number as maximum disk
- * size, let's define maximum "length" (a limit for any offset/bytes request and
- * for disk size) to be the greatest power of 2 less than INT64_MAX.
- */
-#define BDRV_MAX_ALIGNMENT (1L << 30)
-#define BDRV_MAX_LENGTH (QEMU_ALIGN_DOWN(INT64_MAX, BDRV_MAX_ALIGNMENT))
-
-/*
- * Allocation status flags for bdrv_block_status() and friends.
- *
- * Public flags:
- * BDRV_BLOCK_DATA: allocation for data at offset is tied to this layer
- * BDRV_BLOCK_ZERO: offset reads as zero
- * BDRV_BLOCK_OFFSET_VALID: an associated offset exists for accessing raw data
- * BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
- *                       layer rather than any backing, set by block layer
- * BDRV_BLOCK_EOF: the returned pnum covers through end of file for this
- *                 layer, set by block layer
- *
- * Internal flags:
- * BDRV_BLOCK_RAW: for use by passthrough drivers, such as raw, to request
- *                 that the block layer recompute the answer from the returned
- *                 BDS; must be accompanied by just BDRV_BLOCK_OFFSET_VALID.
- * BDRV_BLOCK_RECURSE: request that the block layer will recursively search for
- *                     zeroes in file child of current block node inside
- *                     returned region. Only valid together with both
- *                     BDRV_BLOCK_DATA and BDRV_BLOCK_OFFSET_VALID. Should not
- *                     appear with BDRV_BLOCK_ZERO.
- *
- * If BDRV_BLOCK_OFFSET_VALID is set, the map parameter represents the
- * host offset within the returned BDS that is allocated for the
- * corresponding raw guest data.  However, whether that offset
- * actually contains data also depends on BDRV_BLOCK_DATA, as follows:
- *
- * DATA ZERO OFFSET_VALID
- *  t    t        t       sectors read as zero, returned file is zero at offset
- *  t    f        t       sectors read as valid from file at offset
- *  f    t        t       sectors preallocated, read as zero, returned file not
- *                        necessarily zero at offset
- *  f    f        t       sectors preallocated but read from backing_hd,
- *                        returned file contains garbage at offset
- *  t    t        f       sectors preallocated, read as zero, unknown offset
- *  t    f        f       sectors read from unknown file or offset
- *  f    t        f       not allocated or unknown offset, read as zero
- *  f    f        f       not allocated or unknown offset, read from backing_hd
- */
-#define BDRV_BLOCK_DATA         0x01
-#define BDRV_BLOCK_ZERO         0x02
-#define BDRV_BLOCK_OFFSET_VALID 0x04
-#define BDRV_BLOCK_RAW          0x08
-#define BDRV_BLOCK_ALLOCATED    0x10
-#define BDRV_BLOCK_EOF          0x20
-#define BDRV_BLOCK_RECURSE      0x40
-
-typedef QTAILQ_HEAD(BlockReopenQueue, BlockReopenQueueEntry) BlockReopenQueue;
-
-typedef struct BDRVReopenState {
-    BlockDriverState *bs;
-    int flags;
-    BlockdevDetectZeroesOptions detect_zeroes;
-    bool backing_missing;
-    BlockDriverState *old_backing_bs; /* keep pointer for permissions update */
-    BlockDriverState *old_file_bs; /* keep pointer for permissions update */
-    QDict *options;
-    QDict *explicit_options;
-    void *opaque;
-} BDRVReopenState;
-
-/*
- * Block operation types
- */
-typedef enum BlockOpType {
-    BLOCK_OP_TYPE_BACKUP_SOURCE,
-    BLOCK_OP_TYPE_BACKUP_TARGET,
-    BLOCK_OP_TYPE_CHANGE,
-    BLOCK_OP_TYPE_COMMIT_SOURCE,
-    BLOCK_OP_TYPE_COMMIT_TARGET,
-    BLOCK_OP_TYPE_DATAPLANE,
-    BLOCK_OP_TYPE_DRIVE_DEL,
-    BLOCK_OP_TYPE_EJECT,
-    BLOCK_OP_TYPE_EXTERNAL_SNAPSHOT,
-    BLOCK_OP_TYPE_INTERNAL_SNAPSHOT,
-    BLOCK_OP_TYPE_INTERNAL_SNAPSHOT_DELETE,
-    BLOCK_OP_TYPE_MIRROR_SOURCE,
-    BLOCK_OP_TYPE_MIRROR_TARGET,
-    BLOCK_OP_TYPE_RESIZE,
-    BLOCK_OP_TYPE_STREAM,
-    BLOCK_OP_TYPE_REPLACE,
-    BLOCK_OP_TYPE_MAX,
-} BlockOpType;
-
-/* Block node permission constants */
-enum {
-    /**
-     * A user that has the "permission" of consistent reads is guaranteed that
-     * their view of the contents of the block device is complete and
-     * self-consistent, representing the contents of a disk at a specific
-     * point.
-     *
-     * For most block devices (including their backing files) this is true, but
-     * the property cannot be maintained in a few situations like for
-     * intermediate nodes of a commit block job.
-     */
-    BLK_PERM_CONSISTENT_READ    = 0x01,
-
-    /** This permission is required to change the visible disk contents. */
-    BLK_PERM_WRITE              = 0x02,
-
-    /**
-     * This permission (which is weaker than BLK_PERM_WRITE) is both enough and
-     * required for writes to the block node when the caller promises that
-     * the visible disk content doesn't change.
-     *
-     * As the BLK_PERM_WRITE permission is strictly stronger, either is
-     * sufficient to perform an unchanging write.
-     */
-    BLK_PERM_WRITE_UNCHANGED    = 0x04,
-
-    /** This permission is required to change the size of a block node. */
-    BLK_PERM_RESIZE             = 0x08,
-
-    /**
-     * This permission is required to change the node that this BdrvChild
-     * points to.
-     */
-    BLK_PERM_GRAPH_MOD          = 0x10,
-
-    BLK_PERM_ALL                = 0x1f,
-
-    DEFAULT_PERM_PASSTHROUGH    = BLK_PERM_CONSISTENT_READ
-                                 | BLK_PERM_WRITE
-                                 | BLK_PERM_WRITE_UNCHANGED
-                                 | BLK_PERM_RESIZE,
-
-    DEFAULT_PERM_UNCHANGED      = BLK_PERM_ALL & ~DEFAULT_PERM_PASSTHROUGH,
-};
-
-/*
- * Flags that parent nodes assign to child nodes to specify what kind of
- * role(s) they take.
- *
- * At least one of DATA, METADATA, FILTERED, or COW must be set for
- * every child.
- */
-enum BdrvChildRoleBits {
-    /*
-     * This child stores data.
-     * Any node may have an arbitrary number of such children.
-     */
-    BDRV_CHILD_DATA         = (1 << 0),
-
-    /*
-     * This child stores metadata.
-     * Any node may have an arbitrary number of metadata-storing
-     * children.
-     */
-    BDRV_CHILD_METADATA     = (1 << 1),
-
-    /*
-     * A child that always presents exactly the same visible data as
-     * the parent, e.g. by virtue of the parent forwarding all reads
-     * and writes.
-     * This flag is mutually exclusive with DATA, METADATA, and COW.
-     * Any node may have at most one filtered child at a time.
-     */
-    BDRV_CHILD_FILTERED     = (1 << 2),
-
-    /*
-     * Child from which to read all data that isn't allocated in the
-     * parent (i.e., the backing child); such data is copied to the
-     * parent through COW (and optionally COR).
-     * This field is mutually exclusive with DATA, METADATA, and
-     * FILTERED.
-     * Any node may have at most one such backing child at a time.
-     */
-    BDRV_CHILD_COW          = (1 << 3),
-
-    /*
-     * The primary child.  For most drivers, this is the child whose
-     * filename applies best to the parent node.
-     * Any node may have at most one primary child at a time.
-     */
-    BDRV_CHILD_PRIMARY      = (1 << 4),
-
-    /* Useful combination of flags */
-    BDRV_CHILD_IMAGE        = BDRV_CHILD_DATA
-                              | BDRV_CHILD_METADATA
-                              | BDRV_CHILD_PRIMARY,
-};
-
-/* Mask of BdrvChildRoleBits values */
-typedef unsigned int BdrvChildRole;
-
-char *bdrv_perm_names(uint64_t perm);
-uint64_t bdrv_qapi_perm_to_blk_perm(BlockPermission qapi_perm);
-
-/* disk I/O throttling */
-void bdrv_init(void);
-void bdrv_init_with_whitelist(void);
-bool bdrv_uses_whitelist(void);
-int bdrv_is_whitelisted(BlockDriver *drv, bool read_only);
-BlockDriver *bdrv_find_protocol(const char *filename,
-                                bool allow_protocol_prefix,
-                                Error **errp);
-BlockDriver *bdrv_find_format(const char *format_name);
-int bdrv_create(BlockDriver *drv, const char* filename,
-                QemuOpts *opts, Error **errp);
-int bdrv_create_file(const char *filename, QemuOpts *opts, Error **errp);
-
-BlockDriverState *bdrv_new(void);
-int bdrv_append(BlockDriverState *bs_new, BlockDriverState *bs_top,
-                Error **errp);
-int bdrv_replace_node(BlockDriverState *from, BlockDriverState *to,
-                      Error **errp);
-int bdrv_replace_child_bs(BdrvChild *child, BlockDriverState *new_bs,
-                          Error **errp);
-BlockDriverState *bdrv_insert_node(BlockDriverState *bs, QDict *node_options,
-                                   int flags, Error **errp);
-int bdrv_drop_filter(BlockDriverState *bs, Error **errp);
-
-int bdrv_parse_aio(const char *mode, int *flags);
-int bdrv_parse_cache_mode(const char *mode, int *flags, bool *writethrough);
-int bdrv_parse_discard_flags(const char *mode, int *flags);
-BdrvChild *bdrv_open_child(const char *filename,
-                           QDict *options, const char *bdref_key,
-                           BlockDriverState* parent,
-                           const BdrvChildClass *child_class,
-                           BdrvChildRole child_role,
-                           bool allow_none, Error **errp);
-BlockDriverState *bdrv_open_blockdev_ref(BlockdevRef *ref, Error **errp);
-int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
-                        Error **errp);
-int bdrv_open_backing_file(BlockDriverState *bs, QDict *parent_options,
-                           const char *bdref_key, Error **errp);
-BlockDriverState *bdrv_open(const char *filename, const char *reference,
-                            QDict *options, int flags, Error **errp);
-BlockDriverState *bdrv_new_open_driver_opts(BlockDriver *drv,
-                                            const char *node_name,
-                                            QDict *options, int flags,
-                                            Error **errp);
-BlockDriverState *bdrv_new_open_driver(BlockDriver *drv, const char *node_name,
-                                       int flags, Error **errp);
-BlockReopenQueue *bdrv_reopen_queue(BlockReopenQueue *bs_queue,
-                                    BlockDriverState *bs, QDict *options,
-                                    bool keep_old_opts);
-void bdrv_reopen_queue_free(BlockReopenQueue *bs_queue);
-int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, Error **errp);
-int bdrv_reopen(BlockDriverState *bs, QDict *opts, bool keep_old_opts,
-                Error **errp);
-int bdrv_reopen_set_read_only(BlockDriverState *bs, bool read_only,
-                              Error **errp);
-int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
-                       int64_t bytes, BdrvRequestFlags flags);
-int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags);
-int bdrv_pread(BdrvChild *child, int64_t offset, void *buf, int64_t bytes);
-int bdrv_pwrite(BdrvChild *child, int64_t offset, const void *buf,
-                int64_t bytes);
-int bdrv_pwrite_sync(BdrvChild *child, int64_t offset,
-                     const void *buf, int64_t bytes);
-/*
- * Efficiently zero a region of the disk image.  Note that this is a regular
- * I/O request like read or write and should have a reasonable size.  This
- * function is not suitable for zeroing the entire image in a single request
- * because it may allocate memory for the entire region.
- */
-int coroutine_fn bdrv_co_pwrite_zeroes(BdrvChild *child, int64_t offset,
-                                       int64_t bytes, BdrvRequestFlags flags);
-BlockDriverState *bdrv_find_backing_image(BlockDriverState *bs,
-    const char *backing_file);
-void bdrv_refresh_filename(BlockDriverState *bs);
-
-int coroutine_fn bdrv_co_truncate(BdrvChild *child, int64_t offset, bool exact,
-                                  PreallocMode prealloc, BdrvRequestFlags flags,
-                                  Error **errp);
-int generated_co_wrapper
-bdrv_truncate(BdrvChild *child, int64_t offset, bool exact,
-              PreallocMode prealloc, BdrvRequestFlags flags, Error **errp);
-
-int64_t bdrv_nb_sectors(BlockDriverState *bs);
-int64_t bdrv_getlength(BlockDriverState *bs);
-int64_t bdrv_get_allocated_file_size(BlockDriverState *bs);
-BlockMeasureInfo *bdrv_measure(BlockDriver *drv, QemuOpts *opts,
-                               BlockDriverState *in_bs, Error **errp);
-void bdrv_get_geometry(BlockDriverState *bs, uint64_t *nb_sectors_ptr);
-void bdrv_refresh_limits(BlockDriverState *bs, Transaction *tran, Error **errp);
-int bdrv_commit(BlockDriverState *bs);
-int bdrv_make_empty(BdrvChild *c, Error **errp);
-int bdrv_change_backing_file(BlockDriverState *bs, const char *backing_file,
-                             const char *backing_fmt, bool warn);
-void bdrv_register(BlockDriver *bdrv);
-int bdrv_drop_intermediate(BlockDriverState *top, BlockDriverState *base,
-                           const char *backing_file_str);
-BlockDriverState *bdrv_find_overlay(BlockDriverState *active,
-                                    BlockDriverState *bs);
-BlockDriverState *bdrv_find_base(BlockDriverState *bs);
-bool bdrv_is_backing_chain_frozen(BlockDriverState *bs, BlockDriverState *base,
-                                  Error **errp);
-int bdrv_freeze_backing_chain(BlockDriverState *bs, BlockDriverState *base,
-                              Error **errp);
-void bdrv_unfreeze_backing_chain(BlockDriverState *bs, BlockDriverState *base);
-int coroutine_fn bdrv_co_delete_file(BlockDriverState *bs, Error **errp);
-void coroutine_fn bdrv_co_delete_file_noerr(BlockDriverState *bs);
-
-
-typedef struct BdrvCheckResult {
-    int corruptions;
-    int leaks;
-    int check_errors;
-    int corruptions_fixed;
-    int leaks_fixed;
-    int64_t image_end_offset;
-    BlockFragInfo bfi;
-} BdrvCheckResult;
-
-typedef enum {
-    BDRV_FIX_LEAKS    = 1,
-    BDRV_FIX_ERRORS   = 2,
-} BdrvCheckMode;
-
-int generated_co_wrapper bdrv_check(BlockDriverState *bs, BdrvCheckResult *res,
-                                    BdrvCheckMode fix);
-
-/* The units of offset and total_work_size may be chosen arbitrarily by the
- * block driver; total_work_size may change during the course of the amendment
- * operation */
-typedef void BlockDriverAmendStatusCB(BlockDriverState *bs, int64_t offset,
-                                      int64_t total_work_size, void *opaque);
-int bdrv_amend_options(BlockDriverState *bs_new, QemuOpts *opts,
-                       BlockDriverAmendStatusCB *status_cb, void *cb_opaque,
-                       bool force,
-                       Error **errp);
-
-/* check if a named node can be replaced when doing drive-mirror */
-BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs,
-                                        const char *node_name, Error **errp);
-
-/* async block I/O */
-void bdrv_aio_cancel(BlockAIOCB *acb);
-void bdrv_aio_cancel_async(BlockAIOCB *acb);
-
-/* sg packet commands */
-int bdrv_co_ioctl(BlockDriverState *bs, int req, void *buf);
-
-/* Invalidate any cached metadata used by image formats */
-int generated_co_wrapper bdrv_invalidate_cache(BlockDriverState *bs,
-                                               Error **errp);
-void bdrv_invalidate_cache_all(Error **errp);
-int bdrv_inactivate_all(void);
-
-/* Ensure contents are flushed to disk.  */
-int generated_co_wrapper bdrv_flush(BlockDriverState *bs);
-int coroutine_fn bdrv_co_flush(BlockDriverState *bs);
-int bdrv_flush_all(void);
-void bdrv_close_all(void);
-void bdrv_drain(BlockDriverState *bs);
-void coroutine_fn bdrv_co_drain(BlockDriverState *bs);
-void bdrv_drain_all_begin(void);
-void bdrv_drain_all_end(void);
-void bdrv_drain_all(void);
-
-#define BDRV_POLL_WHILE(bs, cond) ({                       \
-    BlockDriverState *bs_ = (bs);                          \
-    AIO_WAIT_WHILE(bdrv_get_aio_context(bs_),              \
-                   cond); })
-
-int generated_co_wrapper bdrv_pdiscard(BdrvChild *child, int64_t offset,
-                                       int64_t bytes);
-int bdrv_co_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes);
-int bdrv_has_zero_init_1(BlockDriverState *bs);
-int bdrv_has_zero_init(BlockDriverState *bs);
-bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs);
-int bdrv_block_status(BlockDriverState *bs, int64_t offset,
-                      int64_t bytes, int64_t *pnum, int64_t *map,
-                      BlockDriverState **file);
-int bdrv_block_status_above(BlockDriverState *bs, BlockDriverState *base,
-                            int64_t offset, int64_t bytes, int64_t *pnum,
-                            int64_t *map, BlockDriverState **file);
-int bdrv_is_allocated(BlockDriverState *bs, int64_t offset, int64_t bytes,
-                      int64_t *pnum);
-int bdrv_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
-                            bool include_base, int64_t offset, int64_t bytes,
-                            int64_t *pnum);
-int coroutine_fn bdrv_co_is_zero_fast(BlockDriverState *bs, int64_t offset,
-                                      int64_t bytes);
-
-bool bdrv_is_read_only(BlockDriverState *bs);
-int bdrv_can_set_read_only(BlockDriverState *bs, bool read_only,
-                           bool ignore_allow_rdw, Error **errp);
-int bdrv_apply_auto_read_only(BlockDriverState *bs, const char *errmsg,
-                              Error **errp);
-bool bdrv_is_writable(BlockDriverState *bs);
-bool bdrv_is_sg(BlockDriverState *bs);
-bool bdrv_is_inserted(BlockDriverState *bs);
-void bdrv_lock_medium(BlockDriverState *bs, bool locked);
-void bdrv_eject(BlockDriverState *bs, bool eject_flag);
-const char *bdrv_get_format_name(BlockDriverState *bs);
-BlockDriverState *bdrv_find_node(const char *node_name);
-BlockDeviceInfoList *bdrv_named_nodes_list(bool flat, Error **errp);
-XDbgBlockGraph *bdrv_get_xdbg_block_graph(Error **errp);
-BlockDriverState *bdrv_lookup_bs(const char *device,
-                                 const char *node_name,
-                                 Error **errp);
-bool bdrv_chain_contains(BlockDriverState *top, BlockDriverState *base);
-BlockDriverState *bdrv_next_node(BlockDriverState *bs);
-BlockDriverState *bdrv_next_all_states(BlockDriverState *bs);
-
-typedef struct BdrvNextIterator {
-    enum {
-        BDRV_NEXT_BACKEND_ROOTS,
-        BDRV_NEXT_MONITOR_OWNED,
-    } phase;
-    BlockBackend *blk;
-    BlockDriverState *bs;
-} BdrvNextIterator;
-
-BlockDriverState *bdrv_first(BdrvNextIterator *it);
-BlockDriverState *bdrv_next(BdrvNextIterator *it);
-void bdrv_next_cleanup(BdrvNextIterator *it);
-
-BlockDriverState *bdrv_next_monitor_owned(BlockDriverState *bs);
-bool bdrv_supports_compressed_writes(BlockDriverState *bs);
-void bdrv_iterate_format(void (*it)(void *opaque, const char *name),
-                         void *opaque, bool read_only);
-const char *bdrv_get_node_name(const BlockDriverState *bs);
-const char *bdrv_get_device_name(const BlockDriverState *bs);
-const char *bdrv_get_device_or_node_name(const BlockDriverState *bs);
-int bdrv_get_flags(BlockDriverState *bs);
-int bdrv_get_info(BlockDriverState *bs, BlockDriverInfo *bdi);
-ImageInfoSpecific *bdrv_get_specific_info(BlockDriverState *bs,
-                                          Error **errp);
-BlockStatsSpecific *bdrv_get_specific_stats(BlockDriverState *bs);
-void bdrv_round_to_clusters(BlockDriverState *bs,
-                            int64_t offset, int64_t bytes,
-                            int64_t *cluster_offset,
-                            int64_t *cluster_bytes);
-
-void bdrv_get_backing_filename(BlockDriverState *bs,
-                               char *filename, int filename_size);
-char *bdrv_get_full_backing_filename(BlockDriverState *bs, Error **errp);
-char *bdrv_get_full_backing_filename_from_filename(const char *backed,
-                                                   const char *backing,
-                                                   Error **errp);
-char *bdrv_dirname(BlockDriverState *bs, Error **errp);
-
-int path_has_protocol(const char *path);
-int path_is_absolute(const char *path);
-char *path_combine(const char *base_path, const char *filename);
-
-int generated_co_wrapper
-bdrv_readv_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
-int generated_co_wrapper
-bdrv_writev_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
-int bdrv_save_vmstate(BlockDriverState *bs, const uint8_t *buf,
-                      int64_t pos, int size);
-
-int bdrv_load_vmstate(BlockDriverState *bs, uint8_t *buf,
-                      int64_t pos, int size);
-
-void bdrv_img_create(const char *filename, const char *fmt,
-                     const char *base_filename, const char *base_fmt,
-                     char *options, uint64_t img_size, int flags,
-                     bool quiet, Error **errp);
-
-/* Returns the alignment in bytes that is required so that no bounce buffer
- * is required throughout the stack */
-size_t bdrv_min_mem_align(BlockDriverState *bs);
-/* Returns optimal alignment in bytes for bounce buffer */
-size_t bdrv_opt_mem_align(BlockDriverState *bs);
-void *qemu_blockalign(BlockDriverState *bs, size_t size);
-void *qemu_blockalign0(BlockDriverState *bs, size_t size);
-void *qemu_try_blockalign(BlockDriverState *bs, size_t size);
-void *qemu_try_blockalign0(BlockDriverState *bs, size_t size);
-bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov);
-
-void bdrv_enable_copy_on_read(BlockDriverState *bs);
-void bdrv_disable_copy_on_read(BlockDriverState *bs);
-
-void bdrv_ref(BlockDriverState *bs);
-void bdrv_unref(BlockDriverState *bs);
-void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child);
-BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
-                             BlockDriverState *child_bs,
-                             const char *child_name,
-                             const BdrvChildClass *child_class,
-                             BdrvChildRole child_role,
-                             Error **errp);
-
-bool bdrv_op_is_blocked(BlockDriverState *bs, BlockOpType op, Error **errp);
-void bdrv_op_block(BlockDriverState *bs, BlockOpType op, Error *reason);
-void bdrv_op_unblock(BlockDriverState *bs, BlockOpType op, Error *reason);
-void bdrv_op_block_all(BlockDriverState *bs, Error *reason);
-void bdrv_op_unblock_all(BlockDriverState *bs, Error *reason);
-bool bdrv_op_blocker_is_empty(BlockDriverState *bs);
-
-#define BLKDBG_EVENT(child, evt) \
-    do { \
-        if (child) { \
-            bdrv_debug_event(child->bs, evt); \
-        } \
-    } while (0)
-
-void bdrv_debug_event(BlockDriverState *bs, BlkdebugEvent event);
-
-int bdrv_debug_breakpoint(BlockDriverState *bs, const char *event,
-                           const char *tag);
-int bdrv_debug_remove_breakpoint(BlockDriverState *bs, const char *tag);
-int bdrv_debug_resume(BlockDriverState *bs, const char *tag);
-bool bdrv_debug_is_suspended(BlockDriverState *bs, const char *tag);
-
-/**
- * bdrv_get_aio_context:
+ * QEMU System Emulator block driver
  *
- * Returns: the currently bound #AioContext
- */
-AioContext *bdrv_get_aio_context(BlockDriverState *bs);
-
-/**
- * Move the current coroutine to the AioContext of @bs and return the old
- * AioContext of the coroutine. Increase bs->in_flight so that draining @bs
- * will wait for the operation to proceed until the corresponding
- * bdrv_co_leave().
+ * Copyright (c) 2003 Fabrice Bellard
  *
- * Consequently, you can't call drain inside a bdrv_co_enter/leave() section as
- * this will deadlock.
- */
-AioContext *coroutine_fn bdrv_co_enter(BlockDriverState *bs);
-
-/**
- * Ends a section started by bdrv_co_enter(). Move the current coroutine back
- * to old_ctx and decrease bs->in_flight again.
- */
-void coroutine_fn bdrv_co_leave(BlockDriverState *bs, AioContext *old_ctx);
-
-/**
- * Locks the AioContext of @bs if it's not the current AioContext. This avoids
- * double locking which could lead to deadlocks: This is a coroutine_fn, so we
- * know we already own the lock of the current AioContext.
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
  *
- * May only be called in the main thread.
- */
-void coroutine_fn bdrv_co_lock(BlockDriverState *bs);
-
-/**
- * Unlocks the AioContext of @bs if it's not the current AioContext.
- */
-void coroutine_fn bdrv_co_unlock(BlockDriverState *bs);
-
-/**
- * Transfer control to @co in the aio context of @bs
- */
-void bdrv_coroutine_enter(BlockDriverState *bs, Coroutine *co);
-
-void bdrv_set_aio_context_ignore(BlockDriverState *bs,
-                                 AioContext *new_context, GSList **ignore);
-int bdrv_try_set_aio_context(BlockDriverState *bs, AioContext *ctx,
-                             Error **errp);
-int bdrv_child_try_set_aio_context(BlockDriverState *bs, AioContext *ctx,
-                                   BdrvChild *ignore_child, Error **errp);
-bool bdrv_child_can_set_aio_context(BdrvChild *c, AioContext *ctx,
-                                    GSList **ignore, Error **errp);
-bool bdrv_can_set_aio_context(BlockDriverState *bs, AioContext *ctx,
-                              GSList **ignore, Error **errp);
-AioContext *bdrv_child_get_parent_aio_context(BdrvChild *c);
-AioContext *child_of_bds_get_parent_aio_context(BdrvChild *c);
-
-int bdrv_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz);
-int bdrv_probe_geometry(BlockDriverState *bs, HDGeometry *geo);
-
-void bdrv_io_plug(BlockDriverState *bs);
-void bdrv_io_unplug(BlockDriverState *bs);
-
-/**
- * bdrv_parent_drained_begin_single:
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
  *
- * Begin a quiesced section for the parent of @c. If @poll is true, wait for
- * any pending activity to cease.
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
  */
-void bdrv_parent_drained_begin_single(BdrvChild *c, bool poll);
-
-/**
- * bdrv_parent_drained_end_single:
- *
- * End a quiesced section for the parent of @c.
- *
- * This polls @bs's AioContext until all scheduled sub-drained_ends
- * have settled, which may result in graph changes.
- */
-void bdrv_parent_drained_end_single(BdrvChild *c);
-
-/**
- * bdrv_drain_poll:
- *
- * Poll for pending requests in @bs, its parents (except for @ignore_parent),
- * and if @recursive is true its children as well (used for subtree drain).
- *
- * If @ignore_bds_parents is true, parents that are BlockDriverStates must
- * ignore the drain request because they will be drained separately (used for
- * drain_all).
- *
- * This is part of bdrv_drained_begin.
- */
-bool bdrv_drain_poll(BlockDriverState *bs, bool recursive,
-                     BdrvChild *ignore_parent, bool ignore_bds_parents);
-
-/**
- * bdrv_drained_begin:
- *
- * Begin a quiesced section for exclusive access to the BDS, by disabling
- * external request sources including NBD server, block jobs, and device model.
- *
- * This function can be recursive.
- */
-void bdrv_drained_begin(BlockDriverState *bs);
-
-/**
- * bdrv_do_drained_begin_quiesce:
- *
- * Quiesces a BDS like bdrv_drained_begin(), but does not wait for already
- * running requests to complete.
- */
-void bdrv_do_drained_begin_quiesce(BlockDriverState *bs,
-                                   BdrvChild *parent, bool ignore_bds_parents);
-
-/**
- * Like bdrv_drained_begin, but recursively begins a quiesced section for
- * exclusive access to all child nodes as well.
- */
-void bdrv_subtree_drained_begin(BlockDriverState *bs);
-
-/**
- * bdrv_drained_end:
- *
- * End a quiescent section started by bdrv_drained_begin().
- *
- * This polls @bs's AioContext until all scheduled sub-drained_ends
- * have settled.  On one hand, that may result in graph changes.  On
- * the other, this requires that the caller either runs in the main
- * loop; or that all involved nodes (@bs and all of its parents) are
- * in the caller's AioContext.
- */
-void bdrv_drained_end(BlockDriverState *bs);
-
-/**
- * bdrv_drained_end_no_poll:
- *
- * Same as bdrv_drained_end(), but do not poll for the subgraph to
- * actually become unquiesced.  Therefore, no graph changes will occur
- * with this function.
- *
- * *drained_end_counter is incremented for every background operation
- * that is scheduled, and will be decremented for every operation once
- * it settles.  The caller must poll until it reaches 0.  The counter
- * should be accessed using atomic operations only.
- */
-void bdrv_drained_end_no_poll(BlockDriverState *bs, int *drained_end_counter);
-
-/**
- * End a quiescent section started by bdrv_subtree_drained_begin().
- */
-void bdrv_subtree_drained_end(BlockDriverState *bs);
-
-void bdrv_add_child(BlockDriverState *parent, BlockDriverState *child,
-                    Error **errp);
-void bdrv_del_child(BlockDriverState *parent, BdrvChild *child, Error **errp);
-
-bool bdrv_can_store_new_dirty_bitmap(BlockDriverState *bs, const char *name,
-                                     uint32_t granularity, Error **errp);
-/**
- *
- * bdrv_register_buf/bdrv_unregister_buf:
- *
- * Register/unregister a buffer for I/O. For example, VFIO drivers are
- * interested to know the memory areas that would later be used for I/O, so
- * that they can prepare IOMMU mapping etc., to get better performance.
- */
-void bdrv_register_buf(BlockDriverState *bs, void *host, size_t size);
-void bdrv_unregister_buf(BlockDriverState *bs, void *host);
+#ifndef BLOCK_H
+#define BLOCK_H
 
-/**
- *
- * bdrv_co_copy_range:
- *
- * Do offloaded copy between two children. If the operation is not implemented
- * by the driver, or if the backend storage doesn't support it, a negative
- * error code will be returned.
- *
- * Note: block layer doesn't emulate or fallback to a bounce buffer approach
- * because usually the caller shouldn't attempt offloaded copy any more (e.g.
- * calling copy_file_range(2)) after the first error, thus it should fall back
- * to a read+write path in the caller level.
- *
- * @src: Source child to copy data from
- * @src_offset: offset in @src image to read data
- * @dst: Destination child to copy data to
- * @dst_offset: offset in @dst image to write data
- * @bytes: number of bytes to copy
- * @flags: request flags. Supported flags:
- *         BDRV_REQ_ZERO_WRITE - treat the @src range as zero data and do zero
- *                               write on @dst as if bdrv_co_pwrite_zeroes is
- *                               called. Used to simplify caller code, or
- *                               during BlockDriver.bdrv_co_copy_range_from()
- *                               recursion.
- *         BDRV_REQ_NO_SERIALISING - do not serialize with other overlapping
- *                                   requests currently in flight.
- *
- * Returns: 0 if succeeded; negative error code if failed.
- **/
-int coroutine_fn bdrv_co_copy_range(BdrvChild *src, int64_t src_offset,
-                                    BdrvChild *dst, int64_t dst_offset,
-                                    int64_t bytes, BdrvRequestFlags read_flags,
-                                    BdrvRequestFlags write_flags);
+#include "block-global-state.h"
+#include "block-io.h"
 
-void bdrv_cancel_in_flight(BlockDriverState *bs);
+/* DO NOT ADD ANYTHING IN HERE. USE ONE OF THE HEADERS INCLUDED ABOVE */
 
-#endif
+#endif /* BLOCK_H */
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 03/25] assertions for block global state API
  2021-10-25 10:17 [PATCH v4 00/25] block layer: split block APIs in global state and I/O Emanuele Giuseppe Esposito
  2021-10-25 10:17 ` [PATCH v4 01/25] main-loop.h: introduce qemu_in_main_thread() Emanuele Giuseppe Esposito
  2021-10-25 10:17 ` [PATCH v4 02/25] include/block/block: split header into I/O and global state API Emanuele Giuseppe Esposito
@ 2021-10-25 10:17 ` Emanuele Giuseppe Esposito
  2021-11-11 16:32   ` Hanna Reitz
  2021-11-12 11:31   ` Hanna Reitz
  2021-10-25 10:17 ` [PATCH v4 04/25] include/sysemu/block-backend: split header into I/O and global state (GS) API Emanuele Giuseppe Esposito
                   ` (24 subsequent siblings)
  27 siblings, 2 replies; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-10-25 10:17 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Emanuele Giuseppe Esposito, Richard Henderson, Markus Armbruster,
	Dr. David Alan Gilbert, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Eric Blake

All the global state (GS) API functions will check that
qemu_in_main_thread() returns true. If not, it means
that the safety of BQL cannot be guaranteed, and
they need to be moved to I/O.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block.c        | 136 +++++++++++++++++++++++++++++++++++++++++++++++--
 block/commit.c |   2 +
 block/io.c     |  20 ++++++++
 blockdev.c     |   1 +
 4 files changed, 156 insertions(+), 3 deletions(-)

diff --git a/block.c b/block.c
index 6fdb4d7712..672f946065 100644
--- a/block.c
+++ b/block.c
@@ -386,6 +386,7 @@ char *bdrv_get_full_backing_filename(BlockDriverState *bs, Error **errp)
 void bdrv_register(BlockDriver *bdrv)
 {
     assert(bdrv->format_name);
+    assert(qemu_in_main_thread());
     QLIST_INSERT_HEAD(&bdrv_drivers, bdrv, list);
 }
 
@@ -394,6 +395,8 @@ BlockDriverState *bdrv_new(void)
     BlockDriverState *bs;
     int i;
 
+    assert(qemu_in_main_thread());
+
     bs = g_new0(BlockDriverState, 1);
     QLIST_INIT(&bs->dirty_bitmaps);
     for (i = 0; i < BLOCK_OP_TYPE_MAX; i++) {
@@ -421,6 +424,7 @@ BlockDriverState *bdrv_new(void)
 static BlockDriver *bdrv_do_find_format(const char *format_name)
 {
     BlockDriver *drv1;
+    assert(qemu_in_main_thread());
 
     QLIST_FOREACH(drv1, &bdrv_drivers, list) {
         if (!strcmp(drv1->format_name, format_name)) {
@@ -436,6 +440,8 @@ BlockDriver *bdrv_find_format(const char *format_name)
     BlockDriver *drv1;
     int i;
 
+    assert(qemu_in_main_thread());
+
     drv1 = bdrv_do_find_format(format_name);
     if (drv1) {
         return drv1;
@@ -485,11 +491,13 @@ static int bdrv_format_is_whitelisted(const char *format_name, bool read_only)
 
 int bdrv_is_whitelisted(BlockDriver *drv, bool read_only)
 {
+    assert(qemu_in_main_thread());
     return bdrv_format_is_whitelisted(drv->format_name, read_only);
 }
 
 bool bdrv_uses_whitelist(void)
 {
+    assert(qemu_in_main_thread());
     return use_bdrv_whitelist;
 }
 
@@ -520,6 +528,8 @@ int bdrv_create(BlockDriver *drv, const char* filename,
 {
     int ret;
 
+    assert(qemu_in_main_thread());
+
     Coroutine *co;
     CreateCo cco = {
         .drv = drv,
@@ -695,6 +705,8 @@ int bdrv_create_file(const char *filename, QemuOpts *opts, Error **errp)
     QDict *qdict;
     int ret;
 
+    assert(qemu_in_main_thread());
+
     drv = bdrv_find_protocol(filename, true, errp);
     if (drv == NULL) {
         return -ENOENT;
@@ -792,6 +804,7 @@ int bdrv_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz)
 {
     BlockDriver *drv = bs->drv;
     BlockDriverState *filtered = bdrv_filter_bs(bs);
+    assert(qemu_in_main_thread());
 
     if (drv && drv->bdrv_probe_blocksizes) {
         return drv->bdrv_probe_blocksizes(bs, bsz);
@@ -812,6 +825,7 @@ int bdrv_probe_geometry(BlockDriverState *bs, HDGeometry *geo)
 {
     BlockDriver *drv = bs->drv;
     BlockDriverState *filtered = bdrv_filter_bs(bs);
+    assert(qemu_in_main_thread());
 
     if (drv && drv->bdrv_probe_geometry) {
         return drv->bdrv_probe_geometry(bs, geo);
@@ -866,6 +880,7 @@ static BlockDriver *find_hdev_driver(const char *filename)
 {
     int score_max = 0, score;
     BlockDriver *drv = NULL, *d;
+    assert(qemu_in_main_thread());
 
     QLIST_FOREACH(d, &bdrv_drivers, list) {
         if (d->bdrv_probe_device) {
@@ -883,6 +898,7 @@ static BlockDriver *find_hdev_driver(const char *filename)
 static BlockDriver *bdrv_do_find_protocol(const char *protocol)
 {
     BlockDriver *drv1;
+    assert(qemu_in_main_thread());
 
     QLIST_FOREACH(drv1, &bdrv_drivers, list) {
         if (drv1->protocol_name && !strcmp(drv1->protocol_name, protocol)) {
@@ -903,6 +919,7 @@ BlockDriver *bdrv_find_protocol(const char *filename,
     const char *p;
     int i;
 
+    assert(qemu_in_main_thread());
     /* TODO Drivers without bdrv_file_open must be specified explicitly */
 
     /*
@@ -968,6 +985,7 @@ BlockDriver *bdrv_probe_all(const uint8_t *buf, int buf_size,
 {
     int score_max = 0, score;
     BlockDriver *drv = NULL, *d;
+    assert(qemu_in_main_thread());
 
     QLIST_FOREACH(d, &bdrv_drivers, list) {
         if (d->bdrv_probe) {
@@ -1115,6 +1133,7 @@ int bdrv_parse_aio(const char *mode, int *flags)
  */
 int bdrv_parse_discard_flags(const char *mode, int *flags)
 {
+    assert(qemu_in_main_thread());
     *flags &= ~BDRV_O_UNMAP;
 
     if (!strcmp(mode, "off") || !strcmp(mode, "ignore")) {
@@ -1135,6 +1154,7 @@ int bdrv_parse_discard_flags(const char *mode, int *flags)
  */
 int bdrv_parse_cache_mode(const char *mode, int *flags, bool *writethrough)
 {
+    assert(qemu_in_main_thread());
     *flags &= ~BDRV_O_CACHE_MASK;
 
     if (!strcmp(mode, "off") || !strcmp(mode, "none")) {
@@ -1499,6 +1519,7 @@ static void bdrv_assign_node_name(BlockDriverState *bs,
                                   Error **errp)
 {
     char *gen_node_name = NULL;
+    assert(qemu_in_main_thread());
 
     if (!node_name) {
         node_name = gen_node_name = id_generate(ID_BLOCK);
@@ -1623,6 +1644,8 @@ BlockDriverState *bdrv_new_open_driver_opts(BlockDriver *drv,
     BlockDriverState *bs;
     int ret;
 
+    assert(qemu_in_main_thread());
+
     bs = bdrv_new();
     bs->open_flags = flags;
     bs->options = options ?: qdict_new();
@@ -1648,6 +1671,7 @@ BlockDriverState *bdrv_new_open_driver_opts(BlockDriver *drv,
 BlockDriverState *bdrv_new_open_driver(BlockDriver *drv, const char *node_name,
                                        int flags, Error **errp)
 {
+    assert(qemu_in_main_thread());
     return bdrv_new_open_driver_opts(drv, node_name, NULL, flags, errp);
 }
 
@@ -1742,6 +1766,7 @@ static int bdrv_open_common(BlockDriverState *bs, BlockBackend *file,
 
     assert(bs->file == NULL);
     assert(options != NULL && bs->options != options);
+    assert(qemu_in_main_thread());
 
     opts = qemu_opts_create(&bdrv_runtime_opts, NULL, 0, &error_abort);
     if (!qemu_opts_absorb_qdict(opts, options, errp)) {
@@ -2999,6 +3024,8 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
     BdrvChild *child = NULL;
     Transaction *tran = tran_new();
 
+    assert(qemu_in_main_thread());
+
     ret = bdrv_attach_child_noperm(parent_bs, child_bs, child_name, child_class,
                                    child_role, &child, tran, errp);
     if (ret < 0) {
@@ -3025,6 +3052,8 @@ void bdrv_root_unref_child(BdrvChild *child)
 {
     BlockDriverState *child_bs;
 
+    assert(qemu_in_main_thread());
+
     child_bs = child->bs;
     bdrv_detach_child(child);
     bdrv_unref(child_bs);
@@ -3099,6 +3128,7 @@ static void bdrv_unset_inherits_from(BlockDriverState *root, BdrvChild *child,
 /* Callers must ensure that child->frozen is false. */
 void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child)
 {
+    assert(qemu_in_main_thread());
     if (child == NULL) {
         return;
     }
@@ -3249,6 +3279,8 @@ int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
     int ret;
     Transaction *tran = tran_new();
 
+    assert(qemu_in_main_thread());
+
     ret = bdrv_set_backing_noperm(bs, backing_hd, tran, errp);
     if (ret < 0) {
         goto out;
@@ -3284,6 +3316,8 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *parent_options,
     QDict *tmp_parent_options = NULL;
     Error *local_err = NULL;
 
+    assert(qemu_in_main_thread());
+
     if (bs->backing != NULL) {
         goto free_exit;
     }
@@ -3443,6 +3477,8 @@ BdrvChild *bdrv_open_child(const char *filename,
 {
     BlockDriverState *bs;
 
+    assert(qemu_in_main_thread());
+
     bs = bdrv_open_child_bs(filename, options, bdref_key, parent, child_class,
                             child_role, allow_none, errp);
     if (bs == NULL) {
@@ -3465,6 +3501,8 @@ BlockDriverState *bdrv_open_blockdev_ref(BlockdevRef *ref, Error **errp)
     const char *reference = NULL;
     Visitor *v = NULL;
 
+    assert(qemu_in_main_thread());
+
     if (ref->type == QTYPE_QSTRING) {
         reference = ref->u.reference;
     } else {
@@ -3862,6 +3900,8 @@ close_and_fail:
 BlockDriverState *bdrv_open(const char *filename, const char *reference,
                             QDict *options, int flags, Error **errp)
 {
+    assert(qemu_in_main_thread());
+
     return bdrv_open_inherit(filename, reference, options, flags, NULL,
                              NULL, 0, errp);
 }
@@ -4116,12 +4156,15 @@ BlockReopenQueue *bdrv_reopen_queue(BlockReopenQueue *bs_queue,
                                     BlockDriverState *bs,
                                     QDict *options, bool keep_old_opts)
 {
+    assert(qemu_in_main_thread());
+
     return bdrv_reopen_queue_child(bs_queue, bs, options, NULL, 0, false,
                                    NULL, 0, keep_old_opts);
 }
 
 void bdrv_reopen_queue_free(BlockReopenQueue *bs_queue)
 {
+    assert(qemu_in_main_thread());
     if (bs_queue) {
         BlockReopenQueueEntry *bs_entry, *next;
         QTAILQ_FOREACH_SAFE(bs_entry, bs_queue, entry, next) {
@@ -4269,6 +4312,8 @@ int bdrv_reopen(BlockDriverState *bs, QDict *opts, bool keep_old_opts,
     BlockReopenQueue *queue;
     int ret;
 
+    assert(qemu_in_main_thread());
+
     bdrv_subtree_drained_begin(bs);
     if (ctx != qemu_get_aio_context()) {
         aio_context_release(ctx);
@@ -4290,6 +4335,8 @@ int bdrv_reopen_set_read_only(BlockDriverState *bs, bool read_only,
 {
     QDict *opts = qdict_new();
 
+    assert(qemu_in_main_thread());
+
     qdict_put_bool(opts, BDRV_OPT_READ_ONLY, read_only);
 
     return bdrv_reopen(bs, opts, true, errp);
@@ -4744,6 +4791,7 @@ static void bdrv_close(BlockDriverState *bs)
 void bdrv_close_all(void)
 {
     assert(job_next(NULL) == NULL);
+    assert(qemu_in_main_thread());
 
     /* Drop references from requests still in flight, such as canceled block
      * jobs whose AIO context has not been polled yet */
@@ -5025,11 +5073,15 @@ out:
 int bdrv_replace_node(BlockDriverState *from, BlockDriverState *to,
                       Error **errp)
 {
+    assert(qemu_in_main_thread());
+
     return bdrv_replace_node_common(from, to, true, false, errp);
 }
 
 int bdrv_drop_filter(BlockDriverState *bs, Error **errp)
 {
+    assert(qemu_in_main_thread());
+
     return bdrv_replace_node_common(bs, bdrv_filter_or_cow_bs(bs), true, true,
                                     errp);
 }
@@ -5052,6 +5104,8 @@ int bdrv_append(BlockDriverState *bs_new, BlockDriverState *bs_top,
     int ret;
     Transaction *tran = tran_new();
 
+    assert(qemu_in_main_thread());
+
     assert(!bs_new->backing);
 
     ret = bdrv_attach_child_noperm(bs_new, bs_top, "backing",
@@ -5110,6 +5164,7 @@ static void bdrv_delete(BlockDriverState *bs)
 {
     assert(bdrv_op_blocker_is_empty(bs));
     assert(!bs->refcnt);
+    assert(qemu_in_main_thread());
 
     /* remove from list, if necessary */
     if (bs->node_name[0] != '\0') {
@@ -5154,6 +5209,8 @@ BlockDriverState *bdrv_insert_node(BlockDriverState *bs, QDict *options,
 
     node_name = qdict_get_try_str(options, "node-name");
 
+    assert(qemu_in_main_thread());
+
     new_node_bs = bdrv_new_open_driver_opts(drv, node_name, options, flags,
                                             errp);
     options = NULL; /* bdrv_new_open_driver() eats options */
@@ -5214,6 +5271,8 @@ int bdrv_change_backing_file(BlockDriverState *bs, const char *backing_file,
     BlockDriver *drv = bs->drv;
     int ret;
 
+    assert(qemu_in_main_thread());
+
     if (!drv) {
         return -ENOMEDIUM;
     }
@@ -5255,6 +5314,9 @@ int bdrv_change_backing_file(BlockDriverState *bs, const char *backing_file,
 BlockDriverState *bdrv_find_overlay(BlockDriverState *active,
                                     BlockDriverState *bs)
 {
+
+    assert(qemu_in_main_thread());
+
     bs = bdrv_skip_filters(bs);
     active = bdrv_skip_filters(active);
 
@@ -5272,6 +5334,8 @@ BlockDriverState *bdrv_find_overlay(BlockDriverState *active,
 /* Given a BDS, searches for the base layer. */
 BlockDriverState *bdrv_find_base(BlockDriverState *bs)
 {
+    assert(qemu_in_main_thread());
+
     return bdrv_find_overlay(bs, NULL);
 }
 
@@ -5286,6 +5350,8 @@ bool bdrv_is_backing_chain_frozen(BlockDriverState *bs, BlockDriverState *base,
     BlockDriverState *i;
     BdrvChild *child;
 
+    assert(qemu_in_main_thread());
+
     for (i = bs; i != base; i = child_bs(child)) {
         child = bdrv_filter_or_cow_child(i);
 
@@ -5312,6 +5378,8 @@ int bdrv_freeze_backing_chain(BlockDriverState *bs, BlockDriverState *base,
     BlockDriverState *i;
     BdrvChild *child;
 
+    assert(qemu_in_main_thread());
+
     if (bdrv_is_backing_chain_frozen(bs, base, errp)) {
         return -EPERM;
     }
@@ -5346,6 +5414,8 @@ void bdrv_unfreeze_backing_chain(BlockDriverState *bs, BlockDriverState *base)
     BlockDriverState *i;
     BdrvChild *child;
 
+    assert(qemu_in_main_thread());
+
     for (i = bs; i != base; i = child_bs(child)) {
         child = bdrv_filter_or_cow_child(i);
         if (child) {
@@ -5395,6 +5465,8 @@ int bdrv_drop_intermediate(BlockDriverState *top, BlockDriverState *base,
     g_autoptr(GSList) updated_children = NULL;
     GSList *p;
 
+    assert(qemu_in_main_thread());
+
     bdrv_ref(top);
     bdrv_subtree_drained_begin(top);
 
@@ -5606,7 +5678,6 @@ int64_t bdrv_getlength(BlockDriverState *bs)
 void bdrv_get_geometry(BlockDriverState *bs, uint64_t *nb_sectors_ptr)
 {
     int64_t nb_sectors = bdrv_nb_sectors(bs);
-
     *nb_sectors_ptr = nb_sectors < 0 ? 0 : nb_sectors;
 }
 
@@ -5656,6 +5727,8 @@ void bdrv_iterate_format(void (*it)(void *opaque, const char *name),
     int i;
     const char **formats = NULL;
 
+    assert(qemu_in_main_thread());
+
     QLIST_FOREACH(drv, &bdrv_drivers, list) {
         if (drv->format_name) {
             bool found = false;
@@ -5714,6 +5787,7 @@ BlockDriverState *bdrv_find_node(const char *node_name)
     BlockDriverState *bs;
 
     assert(node_name);
+    assert(qemu_in_main_thread());
 
     QTAILQ_FOREACH(bs, &graph_bdrv_states, node_list) {
         if (!strcmp(node_name, bs->node_name)) {
@@ -5730,6 +5804,8 @@ BlockDeviceInfoList *bdrv_named_nodes_list(bool flat,
     BlockDeviceInfoList *list;
     BlockDriverState *bs;
 
+    assert(qemu_in_main_thread());
+
     list = NULL;
     QTAILQ_FOREACH(bs, &graph_bdrv_states, node_list) {
         BlockDeviceInfo *info = bdrv_block_device_info(NULL, bs, flat, errp);
@@ -5835,6 +5911,8 @@ XDbgBlockGraph *bdrv_get_xdbg_block_graph(Error **errp)
     BdrvChild *child;
     XDbgBlockGraphConstructor *gr = xdbg_graph_new();
 
+    assert(qemu_in_main_thread());
+
     for (blk = blk_all_next(NULL); blk; blk = blk_all_next(blk)) {
         char *allocated_name = NULL;
         const char *name = blk_name(blk);
@@ -5878,6 +5956,8 @@ BlockDriverState *bdrv_lookup_bs(const char *device,
     BlockBackend *blk;
     BlockDriverState *bs;
 
+    assert(qemu_in_main_thread());
+
     if (device) {
         blk = blk_by_name(device);
 
@@ -5909,6 +5989,9 @@ BlockDriverState *bdrv_lookup_bs(const char *device,
  * return false.  If either argument is NULL, return false. */
 bool bdrv_chain_contains(BlockDriverState *top, BlockDriverState *base)
 {
+
+    assert(qemu_in_main_thread());
+
     while (top && top != base) {
         top = bdrv_filter_or_cow_bs(top);
     }
@@ -5918,6 +6001,7 @@ bool bdrv_chain_contains(BlockDriverState *top, BlockDriverState *base)
 
 BlockDriverState *bdrv_next_node(BlockDriverState *bs)
 {
+    assert(qemu_in_main_thread());
     if (!bs) {
         return QTAILQ_FIRST(&graph_bdrv_states);
     }
@@ -5926,6 +6010,7 @@ BlockDriverState *bdrv_next_node(BlockDriverState *bs)
 
 BlockDriverState *bdrv_next_all_states(BlockDriverState *bs)
 {
+    assert(qemu_in_main_thread());
     if (!bs) {
         return QTAILQ_FIRST(&all_bdrv_states);
     }
@@ -5958,6 +6043,7 @@ const char *bdrv_get_parent_name(const BlockDriverState *bs)
 /* TODO check what callers really want: bs->node_name or blk_name() */
 const char *bdrv_get_device_name(const BlockDriverState *bs)
 {
+    assert(qemu_in_main_thread());
     return bdrv_get_parent_name(bs) ?: "";
 }
 
@@ -5967,22 +6053,26 @@ const char *bdrv_get_device_name(const BlockDriverState *bs)
  * absent, then this returns an empty (non-null) string. */
 const char *bdrv_get_device_or_node_name(const BlockDriverState *bs)
 {
+    assert(qemu_in_main_thread());
     return bdrv_get_parent_name(bs) ?: bs->node_name;
 }
 
 int bdrv_get_flags(BlockDriverState *bs)
 {
+    assert(qemu_in_main_thread());
     return bs->open_flags;
 }
 
 int bdrv_has_zero_init_1(BlockDriverState *bs)
 {
+    assert(qemu_in_main_thread());
     return 1;
 }
 
 int bdrv_has_zero_init(BlockDriverState *bs)
 {
     BlockDriverState *filtered;
+    assert(qemu_in_main_thread());
 
     if (!bs->drv) {
         return 0;
@@ -6079,6 +6169,7 @@ void bdrv_debug_event(BlockDriverState *bs, BlkdebugEvent event)
 
 static BlockDriverState *bdrv_find_debug_node(BlockDriverState *bs)
 {
+    assert(qemu_in_main_thread());
     while (bs && bs->drv && !bs->drv->bdrv_debug_breakpoint) {
         bs = bdrv_primary_bs(bs);
     }
@@ -6094,6 +6185,7 @@ static BlockDriverState *bdrv_find_debug_node(BlockDriverState *bs)
 int bdrv_debug_breakpoint(BlockDriverState *bs, const char *event,
                           const char *tag)
 {
+    assert(qemu_in_main_thread());
     bs = bdrv_find_debug_node(bs);
     if (bs) {
         return bs->drv->bdrv_debug_breakpoint(bs, event, tag);
@@ -6104,6 +6196,7 @@ int bdrv_debug_breakpoint(BlockDriverState *bs, const char *event,
 
 int bdrv_debug_remove_breakpoint(BlockDriverState *bs, const char *tag)
 {
+    assert(qemu_in_main_thread());
     bs = bdrv_find_debug_node(bs);
     if (bs) {
         return bs->drv->bdrv_debug_remove_breakpoint(bs, tag);
@@ -6114,6 +6207,7 @@ int bdrv_debug_remove_breakpoint(BlockDriverState *bs, const char *tag)
 
 int bdrv_debug_resume(BlockDriverState *bs, const char *tag)
 {
+    assert(qemu_in_main_thread());
     while (bs && (!bs->drv || !bs->drv->bdrv_debug_resume)) {
         bs = bdrv_primary_bs(bs);
     }
@@ -6127,6 +6221,7 @@ int bdrv_debug_resume(BlockDriverState *bs, const char *tag)
 
 bool bdrv_debug_is_suspended(BlockDriverState *bs, const char *tag)
 {
+    assert(qemu_in_main_thread());
     while (bs && bs->drv && !bs->drv->bdrv_debug_is_suspended) {
         bs = bdrv_primary_bs(bs);
     }
@@ -6154,6 +6249,8 @@ BlockDriverState *bdrv_find_backing_image(BlockDriverState *bs,
     BlockDriverState *retval = NULL;
     BlockDriverState *bs_below;
 
+    assert(qemu_in_main_thread());
+
     if (!bs || !bs->drv || !backing_file) {
         return NULL;
     }
@@ -6252,6 +6349,7 @@ BlockDriverState *bdrv_find_backing_image(BlockDriverState *bs,
 
 void bdrv_init(void)
 {
+    assert(qemu_in_main_thread());
 #ifdef CONFIG_BDRV_WHITELIST_TOOLS
     use_bdrv_whitelist = 1;
 #endif
@@ -6260,6 +6358,7 @@ void bdrv_init(void)
 
 void bdrv_init_with_whitelist(void)
 {
+    assert(qemu_in_main_thread());
     use_bdrv_whitelist = 1;
     bdrv_init();
 }
@@ -6344,6 +6443,8 @@ void bdrv_invalidate_cache_all(Error **errp)
     BlockDriverState *bs;
     BdrvNextIterator it;
 
+    assert(qemu_in_main_thread());
+
     for (bs = bdrv_first(&it); bs; bs = bdrv_next(&it)) {
         AioContext *aio_context = bdrv_get_aio_context(bs);
         int ret;
@@ -6443,6 +6544,8 @@ int bdrv_inactivate_all(void)
     int ret = 0;
     GSList *aio_ctxs = NULL, *ctx;
 
+    assert(qemu_in_main_thread());
+
     for (bs = bdrv_first(&it); bs; bs = bdrv_next(&it)) {
         AioContext *aio_context = bdrv_get_aio_context(bs);
 
@@ -6520,6 +6623,7 @@ void bdrv_eject(BlockDriverState *bs, bool eject_flag)
 void bdrv_lock_medium(BlockDriverState *bs, bool locked)
 {
     BlockDriver *drv = bs->drv;
+    assert(qemu_in_main_thread());
 
     trace_bdrv_lock_medium(bs, locked);
 
@@ -6531,6 +6635,7 @@ void bdrv_lock_medium(BlockDriverState *bs, bool locked)
 /* Get a reference to bs */
 void bdrv_ref(BlockDriverState *bs)
 {
+    assert(qemu_in_main_thread());
     bs->refcnt++;
 }
 
@@ -6539,6 +6644,7 @@ void bdrv_ref(BlockDriverState *bs)
  * deleted. */
 void bdrv_unref(BlockDriverState *bs)
 {
+    assert(qemu_in_main_thread());
     if (!bs) {
         return;
     }
@@ -6556,6 +6662,7 @@ struct BdrvOpBlocker {
 bool bdrv_op_is_blocked(BlockDriverState *bs, BlockOpType op, Error **errp)
 {
     BdrvOpBlocker *blocker;
+    assert(qemu_in_main_thread());
     assert((int) op >= 0 && op < BLOCK_OP_TYPE_MAX);
     if (!QLIST_EMPTY(&bs->op_blockers[op])) {
         blocker = QLIST_FIRST(&bs->op_blockers[op]);
@@ -6570,6 +6677,7 @@ bool bdrv_op_is_blocked(BlockDriverState *bs, BlockOpType op, Error **errp)
 void bdrv_op_block(BlockDriverState *bs, BlockOpType op, Error *reason)
 {
     BdrvOpBlocker *blocker;
+    assert(qemu_in_main_thread());
     assert((int) op >= 0 && op < BLOCK_OP_TYPE_MAX);
 
     blocker = g_new0(BdrvOpBlocker, 1);
@@ -6580,6 +6688,7 @@ void bdrv_op_block(BlockDriverState *bs, BlockOpType op, Error *reason)
 void bdrv_op_unblock(BlockDriverState *bs, BlockOpType op, Error *reason)
 {
     BdrvOpBlocker *blocker, *next;
+    assert(qemu_in_main_thread());
     assert((int) op >= 0 && op < BLOCK_OP_TYPE_MAX);
     QLIST_FOREACH_SAFE(blocker, &bs->op_blockers[op], list, next) {
         if (blocker->reason == reason) {
@@ -6592,6 +6701,7 @@ void bdrv_op_unblock(BlockDriverState *bs, BlockOpType op, Error *reason)
 void bdrv_op_block_all(BlockDriverState *bs, Error *reason)
 {
     int i;
+    assert(qemu_in_main_thread());
     for (i = 0; i < BLOCK_OP_TYPE_MAX; i++) {
         bdrv_op_block(bs, i, reason);
     }
@@ -6600,6 +6710,7 @@ void bdrv_op_block_all(BlockDriverState *bs, Error *reason)
 void bdrv_op_unblock_all(BlockDriverState *bs, Error *reason)
 {
     int i;
+    assert(qemu_in_main_thread());
     for (i = 0; i < BLOCK_OP_TYPE_MAX; i++) {
         bdrv_op_unblock(bs, i, reason);
     }
@@ -6608,7 +6719,7 @@ void bdrv_op_unblock_all(BlockDriverState *bs, Error *reason)
 bool bdrv_op_blocker_is_empty(BlockDriverState *bs)
 {
     int i;
-
+    assert(qemu_in_main_thread());
     for (i = 0; i < BLOCK_OP_TYPE_MAX; i++) {
         if (!QLIST_EMPTY(&bs->op_blockers[i])) {
             return false;
@@ -6630,6 +6741,8 @@ void bdrv_img_create(const char *filename, const char *fmt,
     Error *local_err = NULL;
     int ret = 0;
 
+    assert(qemu_in_main_thread());
+
     /* Find driver and parse its options */
     drv = bdrv_find_format(fmt);
     if (!drv) {
@@ -7046,6 +7159,7 @@ static bool bdrv_parent_can_set_aio_context(BdrvChild *c, AioContext *ctx,
 bool bdrv_child_can_set_aio_context(BdrvChild *c, AioContext *ctx,
                                     GSList **ignore, Error **errp)
 {
+    assert(qemu_in_main_thread());
     if (g_slist_find(*ignore, c)) {
         return true;
     }
@@ -7064,6 +7178,8 @@ bool bdrv_can_set_aio_context(BlockDriverState *bs, AioContext *ctx,
         return true;
     }
 
+    assert(qemu_in_main_thread());
+
     QLIST_FOREACH(c, &bs->parents, next_parent) {
         if (!bdrv_parent_can_set_aio_context(c, ctx, ignore, errp)) {
             return false;
@@ -7084,6 +7200,8 @@ int bdrv_child_try_set_aio_context(BlockDriverState *bs, AioContext *ctx,
     GSList *ignore;
     bool ret;
 
+    assert(qemu_in_main_thread());
+
     ignore = ignore_child ? g_slist_prepend(NULL, ignore_child) : NULL;
     ret = bdrv_can_set_aio_context(bs, ctx, &ignore, errp);
     g_slist_free(ignore);
@@ -7102,6 +7220,7 @@ int bdrv_child_try_set_aio_context(BlockDriverState *bs, AioContext *ctx,
 int bdrv_try_set_aio_context(BlockDriverState *bs, AioContext *ctx,
                              Error **errp)
 {
+    assert(qemu_in_main_thread());
     return bdrv_child_try_set_aio_context(bs, ctx, NULL, errp);
 }
 
@@ -7115,6 +7234,7 @@ void bdrv_add_aio_context_notifier(BlockDriverState *bs,
         .detach_aio_context   = detach_aio_context,
         .opaque               = opaque
     };
+    assert(qemu_in_main_thread());
 
     QLIST_INSERT_HEAD(&bs->aio_notifiers, ban, list);
 }
@@ -7126,6 +7246,7 @@ void bdrv_remove_aio_context_notifier(BlockDriverState *bs,
                                       void *opaque)
 {
     BdrvAioNotifier *ban, *ban_next;
+    assert(qemu_in_main_thread());
 
     QLIST_FOREACH_SAFE(ban, &bs->aio_notifiers, list, ban_next) {
         if (ban->attached_aio_context == attached_aio_context &&
@@ -7150,6 +7271,7 @@ int bdrv_amend_options(BlockDriverState *bs, QemuOpts *opts,
                        bool force,
                        Error **errp)
 {
+    assert(qemu_in_main_thread());
     if (!bs->drv) {
         error_setg(errp, "Node is ejected");
         return -ENOMEDIUM;
@@ -7220,6 +7342,8 @@ BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs,
     BlockDriverState *to_replace_bs = bdrv_find_node(node_name);
     AioContext *aio_context;
 
+    assert(qemu_in_main_thread());
+
     if (!to_replace_bs) {
         error_setg(errp, "Failed to find node with node-name='%s'", node_name);
         return NULL;
@@ -7381,6 +7505,8 @@ void bdrv_refresh_filename(BlockDriverState *bs)
     bool generate_json_filename; /* Whether our default implementation should
                                     fill exact_filename (false) or not (true) */
 
+    assert(qemu_in_main_thread());
+
     if (!drv) {
         return;
     }
@@ -7503,6 +7629,8 @@ char *bdrv_dirname(BlockDriverState *bs, Error **errp)
     BlockDriver *drv = bs->drv;
     BlockDriverState *child_bs;
 
+    assert(qemu_in_main_thread());
+
     if (!drv) {
         error_setg(errp, "Node '%s' is ejected", bs->node_name);
         return NULL;
@@ -7534,7 +7662,7 @@ char *bdrv_dirname(BlockDriverState *bs, Error **errp)
 void bdrv_add_child(BlockDriverState *parent_bs, BlockDriverState *child_bs,
                     Error **errp)
 {
-
+    assert(qemu_in_main_thread());
     if (!parent_bs->drv || !parent_bs->drv->bdrv_add_child) {
         error_setg(errp, "The node %s does not support adding a child",
                    bdrv_get_device_or_node_name(parent_bs));
@@ -7554,6 +7682,7 @@ void bdrv_del_child(BlockDriverState *parent_bs, BdrvChild *child, Error **errp)
 {
     BdrvChild *tmp;
 
+    assert(qemu_in_main_thread());
     if (!parent_bs->drv || !parent_bs->drv->bdrv_del_child) {
         error_setg(errp, "The node %s does not support removing a child",
                    bdrv_get_device_or_node_name(parent_bs));
@@ -7581,6 +7710,7 @@ int bdrv_make_empty(BdrvChild *c, Error **errp)
     BlockDriver *drv = c->bs->drv;
     int ret;
 
+    assert(qemu_in_main_thread());
     assert(c->perm & (BLK_PERM_WRITE | BLK_PERM_WRITE_UNCHANGED));
 
     if (!drv->bdrv_make_empty) {
diff --git a/block/commit.c b/block/commit.c
index 10cc5ff451..45a414a19b 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -433,6 +433,8 @@ int bdrv_commit(BlockDriverState *bs)
     QEMU_AUTO_VFREE uint8_t *buf = NULL;
     Error *local_err = NULL;
 
+    assert(qemu_in_main_thread());
+
     if (!drv)
         return -ENOMEDIUM;
 
diff --git a/block/io.c b/block/io.c
index bb0a254def..c5d7f8495e 100644
--- a/block/io.c
+++ b/block/io.c
@@ -164,6 +164,8 @@ void bdrv_refresh_limits(BlockDriverState *bs, Transaction *tran, Error **errp)
     BdrvChild *c;
     bool have_limits;
 
+    assert(qemu_in_main_thread());
+
     if (tran) {
         BdrvRefreshLimitsState *s = g_new(BdrvRefreshLimitsState, 1);
         *s = (BdrvRefreshLimitsState) {
@@ -544,6 +546,7 @@ void bdrv_drained_end(BlockDriverState *bs)
 
 void bdrv_drained_end_no_poll(BlockDriverState *bs, int *drained_end_counter)
 {
+    assert(qemu_in_main_thread());
     bdrv_do_drained_end(bs, false, NULL, false, drained_end_counter);
 }
 
@@ -586,12 +589,14 @@ void bdrv_unapply_subtree_drain(BdrvChild *child, BlockDriverState *old_parent)
 void coroutine_fn bdrv_co_drain(BlockDriverState *bs)
 {
     assert(qemu_in_coroutine());
+    assert(qemu_in_main_thread());
     bdrv_drained_begin(bs);
     bdrv_drained_end(bs);
 }
 
 void bdrv_drain(BlockDriverState *bs)
 {
+    assert(qemu_in_main_thread());
     bdrv_drained_begin(bs);
     bdrv_drained_end(bs);
 }
@@ -612,6 +617,7 @@ static bool bdrv_drain_all_poll(void)
 {
     BlockDriverState *bs = NULL;
     bool result = false;
+    assert(qemu_in_main_thread());
 
     /* bdrv_drain_poll() can't make changes to the graph and we are holding the
      * main AioContext lock, so iterating bdrv_next_all_states() is safe. */
@@ -640,6 +646,7 @@ static bool bdrv_drain_all_poll(void)
 void bdrv_drain_all_begin(void)
 {
     BlockDriverState *bs = NULL;
+    assert(qemu_in_main_thread());
 
     if (qemu_in_coroutine()) {
         bdrv_co_yield_to_drain(NULL, true, false, NULL, true, true, NULL);
@@ -696,6 +703,7 @@ void bdrv_drain_all_end(void)
 {
     BlockDriverState *bs = NULL;
     int drained_end_counter = 0;
+    assert(qemu_in_main_thread());
 
     /*
      * bdrv queue is managed by record/replay,
@@ -723,6 +731,7 @@ void bdrv_drain_all_end(void)
 
 void bdrv_drain_all(void)
 {
+    assert(qemu_in_main_thread());
     bdrv_drain_all_begin();
     bdrv_drain_all_end();
 }
@@ -2345,6 +2354,8 @@ int bdrv_flush_all(void)
     BlockDriverState *bs = NULL;
     int result = 0;
 
+    assert(qemu_in_main_thread());
+
     /*
      * bdrv queue is managed by record/replay,
      * creating new flush request for stopping
@@ -2731,6 +2742,7 @@ int bdrv_block_status_above(BlockDriverState *bs, BlockDriverState *base,
 int bdrv_block_status(BlockDriverState *bs, int64_t offset, int64_t bytes,
                       int64_t *pnum, int64_t *map, BlockDriverState **file)
 {
+    assert(qemu_in_main_thread());
     return bdrv_block_status_above(bs, bdrv_filter_or_cow_bs(bs),
                                    offset, bytes, pnum, map, file);
 }
@@ -2800,6 +2812,7 @@ int bdrv_is_allocated_above(BlockDriverState *top,
                             int64_t bytes, int64_t *pnum)
 {
     int depth;
+
     int ret = bdrv_common_block_status_above(top, base, include_base, false,
                                              offset, bytes, pnum, NULL, NULL,
                                              &depth);
@@ -2878,6 +2891,7 @@ bdrv_co_writev_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos)
 int bdrv_save_vmstate(BlockDriverState *bs, const uint8_t *buf,
                       int64_t pos, int size)
 {
+    assert(qemu_in_main_thread());
     QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, size);
     int ret = bdrv_writev_vmstate(bs, &qiov, pos);
 
@@ -2887,6 +2901,7 @@ int bdrv_save_vmstate(BlockDriverState *bs, const uint8_t *buf,
 int bdrv_load_vmstate(BlockDriverState *bs, uint8_t *buf,
                       int64_t pos, int size)
 {
+    assert(qemu_in_main_thread());
     QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, size);
     int ret = bdrv_readv_vmstate(bs, &qiov, pos);
 
@@ -2898,6 +2913,7 @@ int bdrv_load_vmstate(BlockDriverState *bs, uint8_t *buf,
 
 void bdrv_aio_cancel(BlockAIOCB *acb)
 {
+    assert(qemu_in_main_thread());
     qemu_aio_ref(acb);
     bdrv_aio_cancel_async(acb);
     while (acb->refcnt > 1) {
@@ -2922,6 +2938,7 @@ void bdrv_aio_cancel(BlockAIOCB *acb)
  * In either case the completion callback must be called. */
 void bdrv_aio_cancel_async(BlockAIOCB *acb)
 {
+    assert(qemu_in_main_thread());
     if (acb->aiocb_info->cancel_async) {
         acb->aiocb_info->cancel_async(acb);
     }
@@ -3292,6 +3309,7 @@ void bdrv_register_buf(BlockDriverState *bs, void *host, size_t size)
 {
     BdrvChild *child;
 
+    assert(qemu_in_main_thread());
     if (bs->drv && bs->drv->bdrv_register_buf) {
         bs->drv->bdrv_register_buf(bs, host, size);
     }
@@ -3304,6 +3322,7 @@ void bdrv_unregister_buf(BlockDriverState *bs, void *host)
 {
     BdrvChild *child;
 
+    assert(qemu_in_main_thread());
     if (bs->drv && bs->drv->bdrv_unregister_buf) {
         bs->drv->bdrv_unregister_buf(bs, host);
     }
@@ -3575,6 +3594,7 @@ out:
 
 void bdrv_cancel_in_flight(BlockDriverState *bs)
 {
+    assert(qemu_in_main_thread());
     if (!bs || !bs->drv) {
         return;
     }
diff --git a/blockdev.c b/blockdev.c
index b35072644e..ae322ed10e 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -690,6 +690,7 @@ void blockdev_close_all_bdrv_states(void)
 /* Iterates over the list of monitor-owned BlockDriverStates */
 BlockDriverState *bdrv_next_monitor_owned(BlockDriverState *bs)
 {
+    assert(qemu_in_main_thread());
     return bs ? QTAILQ_NEXT(bs, monitor_list)
               : QTAILQ_FIRST(&monitor_bdrv_states);
 }
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 04/25] include/sysemu/block-backend: split header into I/O and global state (GS) API
  2021-10-25 10:17 [PATCH v4 00/25] block layer: split block APIs in global state and I/O Emanuele Giuseppe Esposito
                   ` (2 preceding siblings ...)
  2021-10-25 10:17 ` [PATCH v4 03/25] assertions for block " Emanuele Giuseppe Esposito
@ 2021-10-25 10:17 ` Emanuele Giuseppe Esposito
  2021-11-12 10:23   ` Hanna Reitz
  2021-11-12 12:30   ` Hanna Reitz
  2021-10-25 10:17 ` [PATCH v4 05/25] block/block-backend.c: assertions for block-backend Emanuele Giuseppe Esposito
                   ` (23 subsequent siblings)
  27 siblings, 2 replies; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-10-25 10:17 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Emanuele Giuseppe Esposito, Richard Henderson, Markus Armbruster,
	Dr. David Alan Gilbert, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Eric Blake

Similarly to the previous patches, split block-backend.h
in block-backend-io.h and block-backend-global-state.h

In addition, remove "block/block.h" include as it seems
it is not necessary anymore, together with "qemu/iov.h"

block-backend-common.h contains the structures shared between
the two headers, and the functions that can't be categorized as
I/O or global state.

Assertions are added in the next patch.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/block-backend.c                       |   9 +-
 include/sysemu/block-backend-common.h       |  74 ++++++
 include/sysemu/block-backend-global-state.h | 122 +++++++++
 include/sysemu/block-backend-io.h           | 139 ++++++++++
 include/sysemu/block-backend.h              | 269 +-------------------
 5 files changed, 344 insertions(+), 269 deletions(-)
 create mode 100644 include/sysemu/block-backend-common.h
 create mode 100644 include/sysemu/block-backend-global-state.h
 create mode 100644 include/sysemu/block-backend-io.h

diff --git a/block/block-backend.c b/block/block-backend.c
index 39cd99df2b..0afc03fd66 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -79,6 +79,7 @@ struct BlockBackend {
     bool allow_aio_context_change;
     bool allow_write_beyond_eof;
 
+    /* Protected by BQL lock */
     NotifierList remove_bs_notifiers, insert_bs_notifiers;
     QLIST_HEAD(, BlockBackendAioNotifier) aio_notifiers;
 
@@ -111,12 +112,14 @@ static const AIOCBInfo block_backend_aiocb_info = {
 static void drive_info_del(DriveInfo *dinfo);
 static BlockBackend *bdrv_first_blk(BlockDriverState *bs);
 
-/* All BlockBackends */
+/* All BlockBackends. Protected by BQL lock. */
 static QTAILQ_HEAD(, BlockBackend) block_backends =
     QTAILQ_HEAD_INITIALIZER(block_backends);
 
-/* All BlockBackends referenced by the monitor and which are iterated through by
- * blk_next() */
+/*
+ * All BlockBackends referenced by the monitor and which are iterated through by
+ * blk_next(). Protected by BQL lock.
+ */
 static QTAILQ_HEAD(, BlockBackend) monitor_block_backends =
     QTAILQ_HEAD_INITIALIZER(monitor_block_backends);
 
diff --git a/include/sysemu/block-backend-common.h b/include/sysemu/block-backend-common.h
new file mode 100644
index 0000000000..52ff6a4d26
--- /dev/null
+++ b/include/sysemu/block-backend-common.h
@@ -0,0 +1,74 @@
+/*
+ * QEMU Block backends
+ *
+ * Copyright (C) 2014-2016 Red Hat, Inc.
+ *
+ * Authors:
+ *  Markus Armbruster <armbru@redhat.com>,
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.1
+ * or later.  See the COPYING.LIB file in the top-level directory.
+ */
+
+#ifndef BLOCK_BACKEND_COMMON_H
+#define BLOCK_BACKEND_COMMON_H
+
+#include "block/throttle-groups.h"
+
+/* Callbacks for block device models */
+typedef struct BlockDevOps {
+    /*
+     * Runs when virtual media changed (monitor commands eject, change)
+     * Argument load is true on load and false on eject.
+     * Beware: doesn't run when a host device's physical media
+     * changes.  Sure would be useful if it did.
+     * Device models with removable media must implement this callback.
+     */
+    void (*change_media_cb)(void *opaque, bool load, Error **errp);
+    /*
+     * Runs when an eject request is issued from the monitor, the tray
+     * is closed, and the medium is locked.
+     * Device models that do not implement is_medium_locked will not need
+     * this callback.  Device models that can lock the medium or tray might
+     * want to implement the callback and unlock the tray when "force" is
+     * true, even if they do not support eject requests.
+     */
+    void (*eject_request_cb)(void *opaque, bool force);
+    /*
+     * Is the virtual tray open?
+     * Device models implement this only when the device has a tray.
+     */
+    bool (*is_tray_open)(void *opaque);
+    /*
+     * Is the virtual medium locked into the device?
+     * Device models implement this only when device has such a lock.
+     */
+    bool (*is_medium_locked)(void *opaque);
+    /*
+     * Runs when the size changed (e.g. monitor command block_resize)
+     */
+    void (*resize_cb)(void *opaque);
+    /*
+     * Runs when the backend receives a drain request.
+     */
+    void (*drained_begin)(void *opaque);
+    /*
+     * Runs when the backend's last drain request ends.
+     */
+    void (*drained_end)(void *opaque);
+    /*
+     * Is the device still busy?
+     */
+    bool (*drained_poll)(void *opaque);
+} BlockDevOps;
+
+/*
+ * This struct is embedded in (the private) BlockBackend struct and contains
+ * fields that must be public. This is in particular for QLIST_ENTRY() and
+ * friends so that BlockBackends can be kept in lists outside block-backend.c
+ */
+typedef struct BlockBackendPublic {
+    ThrottleGroupMember throttle_group_member;
+} BlockBackendPublic;
+
+#endif /* BLOCK_BACKEND_COMMON_H */
diff --git a/include/sysemu/block-backend-global-state.h b/include/sysemu/block-backend-global-state.h
new file mode 100644
index 0000000000..4001b1c02a
--- /dev/null
+++ b/include/sysemu/block-backend-global-state.h
@@ -0,0 +1,122 @@
+/*
+ * QEMU Block backends
+ *
+ * Copyright (C) 2014-2016 Red Hat, Inc.
+ *
+ * Authors:
+ *  Markus Armbruster <armbru@redhat.com>,
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.1
+ * or later.  See the COPYING.LIB file in the top-level directory.
+ */
+
+#ifndef BLOCK_BACKEND_GS_H
+#define BLOCK_BACKEND_GS_H
+
+#include "block-backend-common.h"
+
+/*
+ * Global state (GS) API. These functions run under the BQL lock.
+ *
+ * See include/block/block-global-state.h for more information about
+ * the GS API.
+ */
+
+BlockBackend *blk_new(AioContext *ctx, uint64_t perm, uint64_t shared_perm);
+BlockBackend *blk_new_with_bs(BlockDriverState *bs, uint64_t perm,
+                              uint64_t shared_perm, Error **errp);
+BlockBackend *blk_new_open(const char *filename, const char *reference,
+                           QDict *options, int flags, Error **errp);
+int blk_get_refcnt(BlockBackend *blk);
+void blk_ref(BlockBackend *blk);
+void blk_unref(BlockBackend *blk);
+void blk_remove_all_bs(void);
+const char *blk_name(const BlockBackend *blk);
+BlockBackend *blk_by_name(const char *name);
+BlockBackend *blk_next(BlockBackend *blk);
+BlockBackend *blk_all_next(BlockBackend *blk);
+bool monitor_add_blk(BlockBackend *blk, const char *name, Error **errp);
+void monitor_remove_blk(BlockBackend *blk);
+
+BlockBackendPublic *blk_get_public(BlockBackend *blk);
+BlockBackend *blk_by_public(BlockBackendPublic *public);
+
+void blk_remove_bs(BlockBackend *blk);
+int blk_insert_bs(BlockBackend *blk, BlockDriverState *bs, Error **errp);
+bool bdrv_has_blk(BlockDriverState *bs);
+bool bdrv_is_root_node(BlockDriverState *bs);
+int blk_set_perm(BlockBackend *blk, uint64_t perm, uint64_t shared_perm,
+                 Error **errp);
+void blk_get_perm(BlockBackend *blk, uint64_t *perm, uint64_t *shared_perm);
+
+void blk_iostatus_enable(BlockBackend *blk);
+bool blk_iostatus_is_enabled(const BlockBackend *blk);
+BlockDeviceIoStatus blk_iostatus(const BlockBackend *blk);
+void blk_iostatus_disable(BlockBackend *blk);
+void blk_iostatus_reset(BlockBackend *blk);
+void blk_iostatus_set_err(BlockBackend *blk, int error);
+int blk_attach_dev(BlockBackend *blk, DeviceState *dev);
+void blk_detach_dev(BlockBackend *blk, DeviceState *dev);
+DeviceState *blk_get_attached_dev(BlockBackend *blk);
+char *blk_get_attached_dev_id(BlockBackend *blk);
+BlockBackend *blk_by_dev(void *dev);
+BlockBackend *blk_by_qdev_id(const char *id, Error **errp);
+void blk_set_dev_ops(BlockBackend *blk, const BlockDevOps *ops, void *opaque);
+
+int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags);
+int64_t blk_nb_sectors(BlockBackend *blk);
+int blk_commit_all(void);
+void blk_drain(BlockBackend *blk);
+void blk_drain_all(void);
+void blk_set_on_error(BlockBackend *blk, BlockdevOnError on_read_error,
+                      BlockdevOnError on_write_error);
+bool blk_supports_write_perm(BlockBackend *blk);
+bool blk_is_sg(BlockBackend *blk);
+bool blk_enable_write_cache(BlockBackend *blk);
+void blk_set_enable_write_cache(BlockBackend *blk, bool wce);
+void blk_lock_medium(BlockBackend *blk, bool locked);
+void blk_eject(BlockBackend *blk, bool eject_flag);
+int blk_get_flags(BlockBackend *blk);
+void blk_set_guest_block_size(BlockBackend *blk, int align);
+bool blk_op_is_blocked(BlockBackend *blk, BlockOpType op, Error **errp);
+void blk_op_unblock(BlockBackend *blk, BlockOpType op, Error *reason);
+void blk_op_block_all(BlockBackend *blk, Error *reason);
+void blk_op_unblock_all(BlockBackend *blk, Error *reason);
+int blk_set_aio_context(BlockBackend *blk, AioContext *new_context,
+                        Error **errp);
+void blk_add_aio_context_notifier(BlockBackend *blk,
+        void (*attached_aio_context)(AioContext *new_context, void *opaque),
+        void (*detach_aio_context)(void *opaque), void *opaque);
+void blk_remove_aio_context_notifier(BlockBackend *blk,
+                                     void (*attached_aio_context)(AioContext *,
+                                                                  void *),
+                                     void (*detach_aio_context)(void *),
+                                     void *opaque);
+void blk_add_remove_bs_notifier(BlockBackend *blk, Notifier *notify);
+void blk_add_insert_bs_notifier(BlockBackend *blk, Notifier *notify);
+BlockBackendRootState *blk_get_root_state(BlockBackend *blk);
+void blk_update_root_state(BlockBackend *blk);
+bool blk_get_detect_zeroes_from_root_state(BlockBackend *blk);
+int blk_get_open_flags_from_root_state(BlockBackend *blk);
+
+int blk_save_vmstate(BlockBackend *blk, const uint8_t *buf,
+                     int64_t pos, int size);
+int blk_load_vmstate(BlockBackend *blk, uint8_t *buf, int64_t pos, int size);
+int blk_probe_blocksizes(BlockBackend *blk, BlockSizes *bsz);
+int blk_probe_geometry(BlockBackend *blk, HDGeometry *geo);
+BlockAIOCB *blk_abort_aio_request(BlockBackend *blk,
+                                  BlockCompletionFunc *cb,
+                                  void *opaque, int ret);
+
+void blk_set_io_limits(BlockBackend *blk, ThrottleConfig *cfg);
+void blk_io_limits_disable(BlockBackend *blk);
+void blk_io_limits_enable(BlockBackend *blk, const char *group);
+void blk_io_limits_update_group(BlockBackend *blk, const char *group);
+void blk_set_force_allow_inactivate(BlockBackend *blk);
+
+void blk_register_buf(BlockBackend *blk, void *host, size_t size);
+void blk_unregister_buf(BlockBackend *blk, void *host);
+
+const BdrvChild *blk_root(BlockBackend *blk);
+
+#endif /* BLOCK_BACKEND_GS_H */
diff --git a/include/sysemu/block-backend-io.h b/include/sysemu/block-backend-io.h
new file mode 100644
index 0000000000..ab0463cb69
--- /dev/null
+++ b/include/sysemu/block-backend-io.h
@@ -0,0 +1,139 @@
+/*
+ * QEMU Block backends
+ *
+ * Copyright (C) 2014-2016 Red Hat, Inc.
+ *
+ * Authors:
+ *  Markus Armbruster <armbru@redhat.com>,
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.1
+ * or later.  See the COPYING.LIB file in the top-level directory.
+ */
+
+#ifndef BLOCK_BACKEND_IO_H
+#define BLOCK_BACKEND_IO_H
+
+#include "block-backend-common.h"
+
+/*
+ * I/O API functions. These functions are thread-safe.
+ *
+ * See include/block/block-io.h for more information about
+ * the I/O API.
+ */
+
+BlockDriverState *blk_bs(BlockBackend *blk);
+
+int blk_replace_bs(BlockBackend *blk, BlockDriverState *new_bs, Error **errp);
+
+void blk_set_allow_write_beyond_eof(BlockBackend *blk, bool allow);
+void blk_set_allow_aio_context_change(BlockBackend *blk, bool allow);
+void blk_set_disable_request_queuing(BlockBackend *blk, bool disable);
+
+int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int bytes);
+int blk_pwrite(BlockBackend *blk, int64_t offset, const void *buf, int bytes,
+               BdrvRequestFlags flags);
+int coroutine_fn blk_co_preadv(BlockBackend *blk, int64_t offset,
+                               int64_t bytes, QEMUIOVector *qiov,
+                               BdrvRequestFlags flags);
+int coroutine_fn blk_co_pwritev_part(BlockBackend *blk, int64_t offset,
+                                     int64_t bytes,
+                                     QEMUIOVector *qiov, size_t qiov_offset,
+                                     BdrvRequestFlags flags);
+int coroutine_fn blk_co_pwritev(BlockBackend *blk, int64_t offset,
+                                int64_t bytes, QEMUIOVector *qiov,
+                                BdrvRequestFlags flags);
+
+static inline int coroutine_fn blk_co_pread(BlockBackend *blk, int64_t offset,
+                                            int64_t bytes, void *buf,
+                                            BdrvRequestFlags flags)
+{
+    QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
+
+    assert(bytes <= SIZE_MAX);
+
+    return blk_co_preadv(blk, offset, bytes, &qiov, flags);
+}
+
+static inline int coroutine_fn blk_co_pwrite(BlockBackend *blk, int64_t offset,
+                                             int64_t bytes, void *buf,
+                                             BdrvRequestFlags flags)
+{
+    QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
+
+    assert(bytes <= SIZE_MAX);
+
+    return blk_co_pwritev(blk, offset, bytes, &qiov, flags);
+}
+
+BlockAIOCB *blk_aio_pwrite_zeroes(BlockBackend *blk, int64_t offset,
+                                  int64_t bytes, BdrvRequestFlags flags,
+                                  BlockCompletionFunc *cb, void *opaque);
+
+BlockAIOCB *blk_aio_preadv(BlockBackend *blk, int64_t offset,
+                           QEMUIOVector *qiov, BdrvRequestFlags flags,
+                           BlockCompletionFunc *cb, void *opaque);
+BlockAIOCB *blk_aio_pwritev(BlockBackend *blk, int64_t offset,
+                            QEMUIOVector *qiov, BdrvRequestFlags flags,
+                            BlockCompletionFunc *cb, void *opaque);
+BlockAIOCB *blk_aio_flush(BlockBackend *blk,
+                          BlockCompletionFunc *cb, void *opaque);
+BlockAIOCB *blk_aio_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes,
+                             BlockCompletionFunc *cb, void *opaque);
+void blk_aio_cancel(BlockAIOCB *acb);
+void blk_aio_cancel_async(BlockAIOCB *acb);
+int blk_ioctl(BlockBackend *blk, unsigned long int req, void *buf);
+BlockAIOCB *blk_aio_ioctl(BlockBackend *blk, unsigned long int req, void *buf,
+                          BlockCompletionFunc *cb, void *opaque);
+int coroutine_fn blk_co_pdiscard(BlockBackend *blk, int64_t offset,
+                                 int64_t bytes);
+int coroutine_fn blk_co_flush(BlockBackend *blk);
+int blk_flush(BlockBackend *blk);
+void blk_inc_in_flight(BlockBackend *blk);
+void blk_dec_in_flight(BlockBackend *blk);
+bool blk_is_inserted(BlockBackend *blk);
+bool blk_is_available(BlockBackend *blk);
+int64_t blk_getlength(BlockBackend *blk);
+void blk_get_geometry(BlockBackend *blk, uint64_t *nb_sectors_ptr);
+void *blk_try_blockalign(BlockBackend *blk, size_t size);
+void *blk_blockalign(BlockBackend *blk, size_t size);
+bool blk_is_writable(BlockBackend *blk);
+BlockdevOnError blk_get_on_error(BlockBackend *blk, bool is_read);
+BlockErrorAction blk_get_error_action(BlockBackend *blk, bool is_read,
+                                      int error);
+void blk_error_action(BlockBackend *blk, BlockErrorAction action,
+                      bool is_read, int error);
+int blk_get_max_iov(BlockBackend *blk);
+int blk_get_max_hw_iov(BlockBackend *blk);
+
+void blk_invalidate_cache(BlockBackend *blk, Error **errp);
+
+void blk_io_plug(BlockBackend *blk);
+void blk_io_unplug(BlockBackend *blk);
+AioContext *blk_get_aio_context(BlockBackend *blk);
+BlockAcctStats *blk_get_stats(BlockBackend *blk);
+void *blk_aio_get(const AIOCBInfo *aiocb_info, BlockBackend *blk,
+                  BlockCompletionFunc *cb, void *opaque);
+int blk_pwrite_compressed(BlockBackend *blk, int64_t offset, const void *buf,
+                          int64_t bytes);
+int blk_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes);
+int blk_pwrite_zeroes(BlockBackend *blk, int64_t offset,
+                      int64_t bytes, BdrvRequestFlags flags);
+int coroutine_fn blk_co_pwrite_zeroes(BlockBackend *blk, int64_t offset,
+                                      int64_t bytes, BdrvRequestFlags flags);
+int blk_truncate(BlockBackend *blk, int64_t offset, bool exact,
+                 PreallocMode prealloc, BdrvRequestFlags flags, Error **errp);
+
+uint32_t blk_get_request_alignment(BlockBackend *blk);
+uint32_t blk_get_max_transfer(BlockBackend *blk);
+uint64_t blk_get_max_hw_transfer(BlockBackend *blk);
+
+int coroutine_fn blk_co_copy_range(BlockBackend *blk_in, int64_t off_in,
+                                   BlockBackend *blk_out, int64_t off_out,
+                                   int64_t bytes, BdrvRequestFlags read_flags,
+                                   BdrvRequestFlags write_flags);
+
+
+int blk_make_empty(BlockBackend *blk, Error **errp);
+
+#endif /* BLOCK_BACKEND_IO_H */
diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index e5e1524f06..038be9fc40 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -13,272 +13,9 @@
 #ifndef BLOCK_BACKEND_H
 #define BLOCK_BACKEND_H
 
-#include "qemu/iov.h"
-#include "block/throttle-groups.h"
+#include "block-backend-global-state.h"
+#include "block-backend-io.h"
 
-/*
- * TODO Have to include block/block.h for a bunch of block layer
- * types.  Unfortunately, this pulls in the whole BlockDriverState
- * API, which we don't want used by many BlockBackend users.  Some of
- * the types belong here, and the rest should be split into a common
- * header and one for the BlockDriverState API.
- */
-#include "block/block.h"
-
-/* Callbacks for block device models */
-typedef struct BlockDevOps {
-    /*
-     * Runs when virtual media changed (monitor commands eject, change)
-     * Argument load is true on load and false on eject.
-     * Beware: doesn't run when a host device's physical media
-     * changes.  Sure would be useful if it did.
-     * Device models with removable media must implement this callback.
-     */
-    void (*change_media_cb)(void *opaque, bool load, Error **errp);
-    /*
-     * Runs when an eject request is issued from the monitor, the tray
-     * is closed, and the medium is locked.
-     * Device models that do not implement is_medium_locked will not need
-     * this callback.  Device models that can lock the medium or tray might
-     * want to implement the callback and unlock the tray when "force" is
-     * true, even if they do not support eject requests.
-     */
-    void (*eject_request_cb)(void *opaque, bool force);
-    /*
-     * Is the virtual tray open?
-     * Device models implement this only when the device has a tray.
-     */
-    bool (*is_tray_open)(void *opaque);
-    /*
-     * Is the virtual medium locked into the device?
-     * Device models implement this only when device has such a lock.
-     */
-    bool (*is_medium_locked)(void *opaque);
-    /*
-     * Runs when the size changed (e.g. monitor command block_resize)
-     */
-    void (*resize_cb)(void *opaque);
-    /*
-     * Runs when the backend receives a drain request.
-     */
-    void (*drained_begin)(void *opaque);
-    /*
-     * Runs when the backend's last drain request ends.
-     */
-    void (*drained_end)(void *opaque);
-    /*
-     * Is the device still busy?
-     */
-    bool (*drained_poll)(void *opaque);
-} BlockDevOps;
-
-/* This struct is embedded in (the private) BlockBackend struct and contains
- * fields that must be public. This is in particular for QLIST_ENTRY() and
- * friends so that BlockBackends can be kept in lists outside block-backend.c
- * */
-typedef struct BlockBackendPublic {
-    ThrottleGroupMember throttle_group_member;
-} BlockBackendPublic;
-
-BlockBackend *blk_new(AioContext *ctx, uint64_t perm, uint64_t shared_perm);
-BlockBackend *blk_new_with_bs(BlockDriverState *bs, uint64_t perm,
-                              uint64_t shared_perm, Error **errp);
-BlockBackend *blk_new_open(const char *filename, const char *reference,
-                           QDict *options, int flags, Error **errp);
-int blk_get_refcnt(BlockBackend *blk);
-void blk_ref(BlockBackend *blk);
-void blk_unref(BlockBackend *blk);
-void blk_remove_all_bs(void);
-const char *blk_name(const BlockBackend *blk);
-BlockBackend *blk_by_name(const char *name);
-BlockBackend *blk_next(BlockBackend *blk);
-BlockBackend *blk_all_next(BlockBackend *blk);
-bool monitor_add_blk(BlockBackend *blk, const char *name, Error **errp);
-void monitor_remove_blk(BlockBackend *blk);
-
-BlockBackendPublic *blk_get_public(BlockBackend *blk);
-BlockBackend *blk_by_public(BlockBackendPublic *public);
-
-BlockDriverState *blk_bs(BlockBackend *blk);
-void blk_remove_bs(BlockBackend *blk);
-int blk_insert_bs(BlockBackend *blk, BlockDriverState *bs, Error **errp);
-int blk_replace_bs(BlockBackend *blk, BlockDriverState *new_bs, Error **errp);
-bool bdrv_has_blk(BlockDriverState *bs);
-bool bdrv_is_root_node(BlockDriverState *bs);
-int blk_set_perm(BlockBackend *blk, uint64_t perm, uint64_t shared_perm,
-                 Error **errp);
-void blk_get_perm(BlockBackend *blk, uint64_t *perm, uint64_t *shared_perm);
-
-void blk_set_allow_write_beyond_eof(BlockBackend *blk, bool allow);
-void blk_set_allow_aio_context_change(BlockBackend *blk, bool allow);
-void blk_set_disable_request_queuing(BlockBackend *blk, bool disable);
-void blk_iostatus_enable(BlockBackend *blk);
-bool blk_iostatus_is_enabled(const BlockBackend *blk);
-BlockDeviceIoStatus blk_iostatus(const BlockBackend *blk);
-void blk_iostatus_disable(BlockBackend *blk);
-void blk_iostatus_reset(BlockBackend *blk);
-void blk_iostatus_set_err(BlockBackend *blk, int error);
-int blk_attach_dev(BlockBackend *blk, DeviceState *dev);
-void blk_detach_dev(BlockBackend *blk, DeviceState *dev);
-DeviceState *blk_get_attached_dev(BlockBackend *blk);
-char *blk_get_attached_dev_id(BlockBackend *blk);
-BlockBackend *blk_by_dev(void *dev);
-BlockBackend *blk_by_qdev_id(const char *id, Error **errp);
-void blk_set_dev_ops(BlockBackend *blk, const BlockDevOps *ops, void *opaque);
-int coroutine_fn blk_co_preadv(BlockBackend *blk, int64_t offset,
-                               int64_t bytes, QEMUIOVector *qiov,
-                               BdrvRequestFlags flags);
-int coroutine_fn blk_co_pwritev_part(BlockBackend *blk, int64_t offset,
-                                     int64_t bytes,
-                                     QEMUIOVector *qiov, size_t qiov_offset,
-                                     BdrvRequestFlags flags);
-int coroutine_fn blk_co_pwritev(BlockBackend *blk, int64_t offset,
-                               int64_t bytes, QEMUIOVector *qiov,
-                               BdrvRequestFlags flags);
-
-static inline int coroutine_fn blk_co_pread(BlockBackend *blk, int64_t offset,
-                                            int64_t bytes, void *buf,
-                                            BdrvRequestFlags flags)
-{
-    QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
-
-    assert(bytes <= SIZE_MAX);
-
-    return blk_co_preadv(blk, offset, bytes, &qiov, flags);
-}
-
-static inline int coroutine_fn blk_co_pwrite(BlockBackend *blk, int64_t offset,
-                                             int64_t bytes, void *buf,
-                                             BdrvRequestFlags flags)
-{
-    QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
-
-    assert(bytes <= SIZE_MAX);
-
-    return blk_co_pwritev(blk, offset, bytes, &qiov, flags);
-}
-
-int blk_pwrite_zeroes(BlockBackend *blk, int64_t offset,
-                      int64_t bytes, BdrvRequestFlags flags);
-BlockAIOCB *blk_aio_pwrite_zeroes(BlockBackend *blk, int64_t offset,
-                                  int64_t bytes, BdrvRequestFlags flags,
-                                  BlockCompletionFunc *cb, void *opaque);
-int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags);
-int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int bytes);
-int blk_pwrite(BlockBackend *blk, int64_t offset, const void *buf, int bytes,
-               BdrvRequestFlags flags);
-int64_t blk_getlength(BlockBackend *blk);
-void blk_get_geometry(BlockBackend *blk, uint64_t *nb_sectors_ptr);
-int64_t blk_nb_sectors(BlockBackend *blk);
-BlockAIOCB *blk_aio_preadv(BlockBackend *blk, int64_t offset,
-                           QEMUIOVector *qiov, BdrvRequestFlags flags,
-                           BlockCompletionFunc *cb, void *opaque);
-BlockAIOCB *blk_aio_pwritev(BlockBackend *blk, int64_t offset,
-                            QEMUIOVector *qiov, BdrvRequestFlags flags,
-                            BlockCompletionFunc *cb, void *opaque);
-BlockAIOCB *blk_aio_flush(BlockBackend *blk,
-                          BlockCompletionFunc *cb, void *opaque);
-BlockAIOCB *blk_aio_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes,
-                             BlockCompletionFunc *cb, void *opaque);
-void blk_aio_cancel(BlockAIOCB *acb);
-void blk_aio_cancel_async(BlockAIOCB *acb);
-int blk_ioctl(BlockBackend *blk, unsigned long int req, void *buf);
-BlockAIOCB *blk_aio_ioctl(BlockBackend *blk, unsigned long int req, void *buf,
-                          BlockCompletionFunc *cb, void *opaque);
-int coroutine_fn blk_co_pdiscard(BlockBackend *blk, int64_t offset,
-                                 int64_t bytes);
-int coroutine_fn blk_co_flush(BlockBackend *blk);
-int blk_flush(BlockBackend *blk);
-int blk_commit_all(void);
-void blk_inc_in_flight(BlockBackend *blk);
-void blk_dec_in_flight(BlockBackend *blk);
-void blk_drain(BlockBackend *blk);
-void blk_drain_all(void);
-void blk_set_on_error(BlockBackend *blk, BlockdevOnError on_read_error,
-                      BlockdevOnError on_write_error);
-BlockdevOnError blk_get_on_error(BlockBackend *blk, bool is_read);
-BlockErrorAction blk_get_error_action(BlockBackend *blk, bool is_read,
-                                      int error);
-void blk_error_action(BlockBackend *blk, BlockErrorAction action,
-                      bool is_read, int error);
-bool blk_supports_write_perm(BlockBackend *blk);
-bool blk_is_writable(BlockBackend *blk);
-bool blk_is_sg(BlockBackend *blk);
-bool blk_enable_write_cache(BlockBackend *blk);
-void blk_set_enable_write_cache(BlockBackend *blk, bool wce);
-void blk_invalidate_cache(BlockBackend *blk, Error **errp);
-bool blk_is_inserted(BlockBackend *blk);
-bool blk_is_available(BlockBackend *blk);
-void blk_lock_medium(BlockBackend *blk, bool locked);
-void blk_eject(BlockBackend *blk, bool eject_flag);
-int blk_get_flags(BlockBackend *blk);
-uint32_t blk_get_request_alignment(BlockBackend *blk);
-uint32_t blk_get_max_transfer(BlockBackend *blk);
-uint64_t blk_get_max_hw_transfer(BlockBackend *blk);
-int blk_get_max_iov(BlockBackend *blk);
-int blk_get_max_hw_iov(BlockBackend *blk);
-void blk_set_guest_block_size(BlockBackend *blk, int align);
-void *blk_try_blockalign(BlockBackend *blk, size_t size);
-void *blk_blockalign(BlockBackend *blk, size_t size);
-bool blk_op_is_blocked(BlockBackend *blk, BlockOpType op, Error **errp);
-void blk_op_unblock(BlockBackend *blk, BlockOpType op, Error *reason);
-void blk_op_block_all(BlockBackend *blk, Error *reason);
-void blk_op_unblock_all(BlockBackend *blk, Error *reason);
-AioContext *blk_get_aio_context(BlockBackend *blk);
-int blk_set_aio_context(BlockBackend *blk, AioContext *new_context,
-                        Error **errp);
-void blk_add_aio_context_notifier(BlockBackend *blk,
-        void (*attached_aio_context)(AioContext *new_context, void *opaque),
-        void (*detach_aio_context)(void *opaque), void *opaque);
-void blk_remove_aio_context_notifier(BlockBackend *blk,
-                                     void (*attached_aio_context)(AioContext *,
-                                                                  void *),
-                                     void (*detach_aio_context)(void *),
-                                     void *opaque);
-void blk_add_remove_bs_notifier(BlockBackend *blk, Notifier *notify);
-void blk_add_insert_bs_notifier(BlockBackend *blk, Notifier *notify);
-void blk_io_plug(BlockBackend *blk);
-void blk_io_unplug(BlockBackend *blk);
-BlockAcctStats *blk_get_stats(BlockBackend *blk);
-BlockBackendRootState *blk_get_root_state(BlockBackend *blk);
-void blk_update_root_state(BlockBackend *blk);
-bool blk_get_detect_zeroes_from_root_state(BlockBackend *blk);
-int blk_get_open_flags_from_root_state(BlockBackend *blk);
-
-void *blk_aio_get(const AIOCBInfo *aiocb_info, BlockBackend *blk,
-                  BlockCompletionFunc *cb, void *opaque);
-int coroutine_fn blk_co_pwrite_zeroes(BlockBackend *blk, int64_t offset,
-                                      int64_t bytes, BdrvRequestFlags flags);
-int blk_pwrite_compressed(BlockBackend *blk, int64_t offset, const void *buf,
-                          int64_t bytes);
-int blk_truncate(BlockBackend *blk, int64_t offset, bool exact,
-                 PreallocMode prealloc, BdrvRequestFlags flags, Error **errp);
-int blk_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes);
-int blk_save_vmstate(BlockBackend *blk, const uint8_t *buf,
-                     int64_t pos, int size);
-int blk_load_vmstate(BlockBackend *blk, uint8_t *buf, int64_t pos, int size);
-int blk_probe_blocksizes(BlockBackend *blk, BlockSizes *bsz);
-int blk_probe_geometry(BlockBackend *blk, HDGeometry *geo);
-BlockAIOCB *blk_abort_aio_request(BlockBackend *blk,
-                                  BlockCompletionFunc *cb,
-                                  void *opaque, int ret);
-
-void blk_set_io_limits(BlockBackend *blk, ThrottleConfig *cfg);
-void blk_io_limits_disable(BlockBackend *blk);
-void blk_io_limits_enable(BlockBackend *blk, const char *group);
-void blk_io_limits_update_group(BlockBackend *blk, const char *group);
-void blk_set_force_allow_inactivate(BlockBackend *blk);
-
-void blk_register_buf(BlockBackend *blk, void *host, size_t size);
-void blk_unregister_buf(BlockBackend *blk, void *host);
-
-int coroutine_fn blk_co_copy_range(BlockBackend *blk_in, int64_t off_in,
-                                   BlockBackend *blk_out, int64_t off_out,
-                                   int64_t bytes, BdrvRequestFlags read_flags,
-                                   BdrvRequestFlags write_flags);
-
-const BdrvChild *blk_root(BlockBackend *blk);
-
-int blk_make_empty(BlockBackend *blk, Error **errp);
+/* DO NOT ADD ANYTHING IN HERE. USE ONE OF THE HEADERS INCLUDED ABOVE */
 
 #endif
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 05/25] block/block-backend.c: assertions for block-backend
  2021-10-25 10:17 [PATCH v4 00/25] block layer: split block APIs in global state and I/O Emanuele Giuseppe Esposito
                   ` (3 preceding siblings ...)
  2021-10-25 10:17 ` [PATCH v4 04/25] include/sysemu/block-backend: split header into I/O and global state (GS) API Emanuele Giuseppe Esposito
@ 2021-10-25 10:17 ` Emanuele Giuseppe Esposito
  2021-11-12 11:01   ` Hanna Reitz
  2021-10-25 10:17 ` [PATCH v4 06/25] include/block/block_int: split header into I/O and global state API Emanuele Giuseppe Esposito
                   ` (22 subsequent siblings)
  27 siblings, 1 reply; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-10-25 10:17 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Emanuele Giuseppe Esposito, Richard Henderson, Markus Armbruster,
	Dr. David Alan Gilbert, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Eric Blake

All the global state (GS) API functions will check that
qemu_in_main_thread() returns true. If not, it means
that the safety of BQL cannot be guaranteed, and
they need to be moved to I/O.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/block-backend.c  | 90 +++++++++++++++++++++++++++++++++++++++++-
 softmmu/qdev-monitor.c |  2 +
 2 files changed, 91 insertions(+), 1 deletion(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 0afc03fd66..ed45576007 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -228,6 +228,7 @@ static void blk_root_activate(BdrvChild *child, Error **errp)
 
 void blk_set_force_allow_inactivate(BlockBackend *blk)
 {
+    assert(qemu_in_main_thread());
     blk->force_allow_inactivate = true;
 }
 
@@ -346,6 +347,8 @@ BlockBackend *blk_new(AioContext *ctx, uint64_t perm, uint64_t shared_perm)
 {
     BlockBackend *blk;
 
+    assert(qemu_in_main_thread());
+
     blk = g_new0(BlockBackend, 1);
     blk->refcnt = 1;
     blk->ctx = ctx;
@@ -383,6 +386,8 @@ BlockBackend *blk_new_with_bs(BlockDriverState *bs, uint64_t perm,
 {
     BlockBackend *blk = blk_new(bdrv_get_aio_context(bs), perm, shared_perm);
 
+    assert(qemu_in_main_thread());
+
     if (blk_insert_bs(blk, bs, errp) < 0) {
         blk_unref(blk);
         return NULL;
@@ -411,6 +416,8 @@ BlockBackend *blk_new_open(const char *filename, const char *reference,
     uint64_t perm = 0;
     uint64_t shared = BLK_PERM_ALL;
 
+    assert(qemu_in_main_thread());
+
     /*
      * blk_new_open() is mainly used in .bdrv_create implementations and the
      * tools where sharing isn't a major concern because the BDS stays private
@@ -488,6 +495,7 @@ static void drive_info_del(DriveInfo *dinfo)
 
 int blk_get_refcnt(BlockBackend *blk)
 {
+    assert(qemu_in_main_thread());
     return blk ? blk->refcnt : 0;
 }
 
@@ -498,6 +506,7 @@ int blk_get_refcnt(BlockBackend *blk)
 void blk_ref(BlockBackend *blk)
 {
     assert(blk->refcnt > 0);
+    assert(qemu_in_main_thread());
     blk->refcnt++;
 }
 
@@ -508,6 +517,7 @@ void blk_ref(BlockBackend *blk)
  */
 void blk_unref(BlockBackend *blk)
 {
+    assert(qemu_in_main_thread());
     if (blk) {
         assert(blk->refcnt > 0);
         if (blk->refcnt > 1) {
@@ -528,6 +538,7 @@ void blk_unref(BlockBackend *blk)
  */
 BlockBackend *blk_all_next(BlockBackend *blk)
 {
+    assert(qemu_in_main_thread());
     return blk ? QTAILQ_NEXT(blk, link)
                : QTAILQ_FIRST(&block_backends);
 }
@@ -536,6 +547,8 @@ void blk_remove_all_bs(void)
 {
     BlockBackend *blk = NULL;
 
+    assert(qemu_in_main_thread());
+
     while ((blk = blk_all_next(blk)) != NULL) {
         AioContext *ctx = blk_get_aio_context(blk);
 
@@ -559,6 +572,7 @@ void blk_remove_all_bs(void)
  */
 BlockBackend *blk_next(BlockBackend *blk)
 {
+    assert(qemu_in_main_thread());
     return blk ? QTAILQ_NEXT(blk, monitor_link)
                : QTAILQ_FIRST(&monitor_block_backends);
 }
@@ -625,6 +639,7 @@ static void bdrv_next_reset(BdrvNextIterator *it)
 
 BlockDriverState *bdrv_first(BdrvNextIterator *it)
 {
+    assert(qemu_in_main_thread());
     bdrv_next_reset(it);
     return bdrv_next(it);
 }
@@ -662,6 +677,7 @@ bool monitor_add_blk(BlockBackend *blk, const char *name, Error **errp)
 {
     assert(!blk->name);
     assert(name && name[0]);
+    assert(qemu_in_main_thread());
 
     if (!id_wellformed(name)) {
         error_setg(errp, "Invalid device name");
@@ -689,6 +705,8 @@ bool monitor_add_blk(BlockBackend *blk, const char *name, Error **errp)
  */
 void monitor_remove_blk(BlockBackend *blk)
 {
+    assert(qemu_in_main_thread());
+
     if (!blk->name) {
         return;
     }
@@ -704,6 +722,7 @@ void monitor_remove_blk(BlockBackend *blk)
  */
 const char *blk_name(const BlockBackend *blk)
 {
+    assert(qemu_in_main_thread());
     return blk->name ?: "";
 }
 
@@ -715,6 +734,7 @@ BlockBackend *blk_by_name(const char *name)
 {
     BlockBackend *blk = NULL;
 
+    assert(qemu_in_main_thread());
     assert(name);
     while ((blk = blk_next(blk)) != NULL) {
         if (!strcmp(name, blk->name)) {
@@ -749,6 +769,7 @@ static BlockBackend *bdrv_first_blk(BlockDriverState *bs)
  */
 bool bdrv_has_blk(BlockDriverState *bs)
 {
+    assert(qemu_in_main_thread());
     return bdrv_first_blk(bs) != NULL;
 }
 
@@ -759,6 +780,7 @@ bool bdrv_is_root_node(BlockDriverState *bs)
 {
     BdrvChild *c;
 
+    assert(qemu_in_main_thread());
     QLIST_FOREACH(c, &bs->parents, next_parent) {
         if (c->klass != &child_root) {
             return false;
@@ -808,6 +830,7 @@ BlockBackend *blk_by_legacy_dinfo(DriveInfo *dinfo)
  */
 BlockBackendPublic *blk_get_public(BlockBackend *blk)
 {
+    assert(qemu_in_main_thread());
     return &blk->public;
 }
 
@@ -816,6 +839,7 @@ BlockBackendPublic *blk_get_public(BlockBackend *blk)
  */
 BlockBackend *blk_by_public(BlockBackendPublic *public)
 {
+    assert(qemu_in_main_thread());
     return container_of(public, BlockBackend, public);
 }
 
@@ -828,6 +852,8 @@ void blk_remove_bs(BlockBackend *blk)
     BlockDriverState *bs;
     BdrvChild *root;
 
+    assert(qemu_in_main_thread());
+
     notifier_list_notify(&blk->remove_bs_notifiers, blk);
     if (tgm->throttle_state) {
         bs = blk_bs(blk);
@@ -855,6 +881,7 @@ void blk_remove_bs(BlockBackend *blk)
 int blk_insert_bs(BlockBackend *blk, BlockDriverState *bs, Error **errp)
 {
     ThrottleGroupMember *tgm = &blk->public.throttle_group_member;
+    assert(qemu_in_main_thread());
     bdrv_ref(bs);
     blk->root = bdrv_root_attach_child(bs, "root", &child_root,
                                        BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
@@ -889,6 +916,7 @@ int blk_set_perm(BlockBackend *blk, uint64_t perm, uint64_t shared_perm,
 {
     int ret;
 
+    assert(qemu_in_main_thread());
     if (blk->root && !blk->disable_perm) {
         ret = bdrv_child_try_set_perm(blk->root, perm, shared_perm, errp);
         if (ret < 0) {
@@ -904,6 +932,7 @@ int blk_set_perm(BlockBackend *blk, uint64_t perm, uint64_t shared_perm,
 
 void blk_get_perm(BlockBackend *blk, uint64_t *perm, uint64_t *shared_perm)
 {
+    assert(qemu_in_main_thread());
     *perm = blk->perm;
     *shared_perm = blk->shared_perm;
 }
@@ -914,6 +943,7 @@ void blk_get_perm(BlockBackend *blk, uint64_t *perm, uint64_t *shared_perm)
  */
 int blk_attach_dev(BlockBackend *blk, DeviceState *dev)
 {
+    assert(qemu_in_main_thread());
     if (blk->dev) {
         return -EBUSY;
     }
@@ -939,6 +969,7 @@ int blk_attach_dev(BlockBackend *blk, DeviceState *dev)
 void blk_detach_dev(BlockBackend *blk, DeviceState *dev)
 {
     assert(blk->dev == dev);
+    assert(qemu_in_main_thread());
     blk->dev = NULL;
     blk->dev_ops = NULL;
     blk->dev_opaque = NULL;
@@ -952,6 +983,7 @@ void blk_detach_dev(BlockBackend *blk, DeviceState *dev)
  */
 DeviceState *blk_get_attached_dev(BlockBackend *blk)
 {
+    assert(qemu_in_main_thread());
     return blk->dev;
 }
 
@@ -960,6 +992,7 @@ DeviceState *blk_get_attached_dev(BlockBackend *blk)
 char *blk_get_attached_dev_id(BlockBackend *blk)
 {
     DeviceState *dev = blk->dev;
+    assert(qemu_in_main_thread());
 
     if (!dev) {
         return g_strdup("");
@@ -980,6 +1013,8 @@ BlockBackend *blk_by_dev(void *dev)
 {
     BlockBackend *blk = NULL;
 
+    assert(qemu_in_main_thread());
+
     assert(dev != NULL);
     while ((blk = blk_all_next(blk)) != NULL) {
         if (blk->dev == dev) {
@@ -997,6 +1032,7 @@ BlockBackend *blk_by_dev(void *dev)
 void blk_set_dev_ops(BlockBackend *blk, const BlockDevOps *ops,
                      void *opaque)
 {
+    assert(qemu_in_main_thread());
     blk->dev_ops = ops;
     blk->dev_opaque = opaque;
 
@@ -1018,6 +1054,7 @@ void blk_set_dev_ops(BlockBackend *blk, const BlockDevOps *ops,
  */
 void blk_dev_change_media_cb(BlockBackend *blk, bool load, Error **errp)
 {
+    assert(qemu_in_main_thread());
     if (blk->dev_ops && blk->dev_ops->change_media_cb) {
         bool tray_was_open, tray_is_open;
         Error *local_err = NULL;
@@ -1109,6 +1146,7 @@ static void blk_root_resize(BdrvChild *child)
 
 void blk_iostatus_enable(BlockBackend *blk)
 {
+    assert(qemu_in_main_thread());
     blk->iostatus_enabled = true;
     blk->iostatus = BLOCK_DEVICE_IO_STATUS_OK;
 }
@@ -1117,6 +1155,7 @@ void blk_iostatus_enable(BlockBackend *blk)
  * enables it _and_ the VM is configured to stop on errors */
 bool blk_iostatus_is_enabled(const BlockBackend *blk)
 {
+    assert(qemu_in_main_thread());
     return (blk->iostatus_enabled &&
            (blk->on_write_error == BLOCKDEV_ON_ERROR_ENOSPC ||
             blk->on_write_error == BLOCKDEV_ON_ERROR_STOP   ||
@@ -1125,16 +1164,19 @@ bool blk_iostatus_is_enabled(const BlockBackend *blk)
 
 BlockDeviceIoStatus blk_iostatus(const BlockBackend *blk)
 {
+    assert(qemu_in_main_thread());
     return blk->iostatus;
 }
 
 void blk_iostatus_disable(BlockBackend *blk)
 {
+    assert(qemu_in_main_thread());
     blk->iostatus_enabled = false;
 }
 
 void blk_iostatus_reset(BlockBackend *blk)
 {
+    assert(qemu_in_main_thread());
     if (blk_iostatus_is_enabled(blk)) {
         blk->iostatus = BLOCK_DEVICE_IO_STATUS_OK;
     }
@@ -1142,6 +1184,7 @@ void blk_iostatus_reset(BlockBackend *blk)
 
 void blk_iostatus_set_err(BlockBackend *blk, int error)
 {
+    assert(qemu_in_main_thread());
     assert(blk_iostatus_is_enabled(blk));
     if (blk->iostatus == BLOCK_DEVICE_IO_STATUS_OK) {
         blk->iostatus = error == ENOSPC ? BLOCK_DEVICE_IO_STATUS_NOSPACE :
@@ -1341,6 +1384,7 @@ int blk_pwrite_zeroes(BlockBackend *blk, int64_t offset,
 
 int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags)
 {
+    assert(qemu_in_main_thread());
     return bdrv_make_zero(blk->root, flags);
 }
 
@@ -1369,6 +1413,7 @@ BlockAIOCB *blk_abort_aio_request(BlockBackend *blk,
                                   void *opaque, int ret)
 {
     struct BlockBackendAIOCB *acb;
+    assert(qemu_in_main_thread());
 
     blk_inc_in_flight(blk);
     acb = blk_aio_get(&block_backend_aiocb_info, blk, cb, opaque);
@@ -1523,6 +1568,7 @@ void blk_get_geometry(BlockBackend *blk, uint64_t *nb_sectors_ptr)
 
 int64_t blk_nb_sectors(BlockBackend *blk)
 {
+    assert(qemu_in_main_thread());
     if (!blk_is_available(blk)) {
         return -ENOMEDIUM;
     }
@@ -1550,6 +1596,7 @@ BlockAIOCB *blk_aio_pwritev(BlockBackend *blk, int64_t offset,
 
 void blk_aio_cancel(BlockAIOCB *acb)
 {
+    assert(qemu_in_main_thread());
     bdrv_aio_cancel(acb);
 }
 
@@ -1707,6 +1754,8 @@ void blk_drain(BlockBackend *blk)
 {
     BlockDriverState *bs = blk_bs(blk);
 
+    assert(qemu_in_main_thread());
+
     if (bs) {
         bdrv_drained_begin(bs);
     }
@@ -1724,6 +1773,8 @@ void blk_drain_all(void)
 {
     BlockBackend *blk = NULL;
 
+    assert(qemu_in_main_thread());
+
     bdrv_drain_all_begin();
 
     while ((blk = blk_all_next(blk)) != NULL) {
@@ -1743,6 +1794,7 @@ void blk_drain_all(void)
 void blk_set_on_error(BlockBackend *blk, BlockdevOnError on_read_error,
                       BlockdevOnError on_write_error)
 {
+    assert(qemu_in_main_thread());
     blk->on_read_error = on_read_error;
     blk->on_write_error = on_write_error;
 }
@@ -1826,6 +1878,7 @@ void blk_error_action(BlockBackend *blk, BlockErrorAction action,
 bool blk_supports_write_perm(BlockBackend *blk)
 {
     BlockDriverState *bs = blk_bs(blk);
+    assert(qemu_in_main_thread());
 
     if (bs) {
         return !bdrv_is_read_only(bs);
@@ -1846,6 +1899,7 @@ bool blk_is_writable(BlockBackend *blk)
 bool blk_is_sg(BlockBackend *blk)
 {
     BlockDriverState *bs = blk_bs(blk);
+    assert(qemu_in_main_thread());
 
     if (!bs) {
         return false;
@@ -1856,17 +1910,20 @@ bool blk_is_sg(BlockBackend *blk)
 
 bool blk_enable_write_cache(BlockBackend *blk)
 {
+    assert(qemu_in_main_thread());
     return blk->enable_write_cache;
 }
 
 void blk_set_enable_write_cache(BlockBackend *blk, bool wce)
 {
+    assert(qemu_in_main_thread());
     blk->enable_write_cache = wce;
 }
 
 void blk_invalidate_cache(BlockBackend *blk, Error **errp)
 {
     BlockDriverState *bs = blk_bs(blk);
+    assert(qemu_in_main_thread());
 
     if (!bs) {
         error_setg(errp, "Device '%s' has no medium", blk->name);
@@ -1879,7 +1936,6 @@ void blk_invalidate_cache(BlockBackend *blk, Error **errp)
 bool blk_is_inserted(BlockBackend *blk)
 {
     BlockDriverState *bs = blk_bs(blk);
-
     return bs && bdrv_is_inserted(bs);
 }
 
@@ -1891,6 +1947,7 @@ bool blk_is_available(BlockBackend *blk)
 void blk_lock_medium(BlockBackend *blk, bool locked)
 {
     BlockDriverState *bs = blk_bs(blk);
+    assert(qemu_in_main_thread());
 
     if (bs) {
         bdrv_lock_medium(bs, locked);
@@ -1900,6 +1957,8 @@ void blk_lock_medium(BlockBackend *blk, bool locked)
 void blk_eject(BlockBackend *blk, bool eject_flag)
 {
     BlockDriverState *bs = blk_bs(blk);
+    assert(qemu_in_main_thread());
+
     char *id;
 
     if (bs) {
@@ -1917,6 +1976,7 @@ void blk_eject(BlockBackend *blk, bool eject_flag)
 int blk_get_flags(BlockBackend *blk)
 {
     BlockDriverState *bs = blk_bs(blk);
+    assert(qemu_in_main_thread());
 
     if (bs) {
         return bdrv_get_flags(bs);
@@ -1970,6 +2030,7 @@ int blk_get_max_iov(BlockBackend *blk)
 
 void blk_set_guest_block_size(BlockBackend *blk, int align)
 {
+    assert(qemu_in_main_thread());
     blk->guest_block_size = align;
 }
 
@@ -1986,6 +2047,7 @@ void *blk_blockalign(BlockBackend *blk, size_t size)
 bool blk_op_is_blocked(BlockBackend *blk, BlockOpType op, Error **errp)
 {
     BlockDriverState *bs = blk_bs(blk);
+    assert(qemu_in_main_thread());
 
     if (!bs) {
         return false;
@@ -1997,6 +2059,7 @@ bool blk_op_is_blocked(BlockBackend *blk, BlockOpType op, Error **errp)
 void blk_op_unblock(BlockBackend *blk, BlockOpType op, Error *reason)
 {
     BlockDriverState *bs = blk_bs(blk);
+    assert(qemu_in_main_thread());
 
     if (bs) {
         bdrv_op_unblock(bs, op, reason);
@@ -2006,6 +2069,7 @@ void blk_op_unblock(BlockBackend *blk, BlockOpType op, Error *reason)
 void blk_op_block_all(BlockBackend *blk, Error *reason)
 {
     BlockDriverState *bs = blk_bs(blk);
+    assert(qemu_in_main_thread());
 
     if (bs) {
         bdrv_op_block_all(bs, reason);
@@ -2015,6 +2079,7 @@ void blk_op_block_all(BlockBackend *blk, Error *reason)
 void blk_op_unblock_all(BlockBackend *blk, Error *reason)
 {
     BlockDriverState *bs = blk_bs(blk);
+    assert(qemu_in_main_thread());
 
     if (bs) {
         bdrv_op_unblock_all(bs, reason);
@@ -2069,6 +2134,7 @@ static int blk_do_set_aio_context(BlockBackend *blk, AioContext *new_context,
 int blk_set_aio_context(BlockBackend *blk, AioContext *new_context,
                         Error **errp)
 {
+    assert(qemu_in_main_thread());
     return blk_do_set_aio_context(blk, new_context, true, errp);
 }
 
@@ -2105,6 +2171,7 @@ void blk_add_aio_context_notifier(BlockBackend *blk,
 {
     BlockBackendAioNotifier *notifier;
     BlockDriverState *bs = blk_bs(blk);
+    assert(qemu_in_main_thread());
 
     notifier = g_new(BlockBackendAioNotifier, 1);
     notifier->attached_aio_context = attached_aio_context;
@@ -2127,6 +2194,8 @@ void blk_remove_aio_context_notifier(BlockBackend *blk,
     BlockBackendAioNotifier *notifier;
     BlockDriverState *bs = blk_bs(blk);
 
+    assert(qemu_in_main_thread());
+
     if (bs) {
         bdrv_remove_aio_context_notifier(bs, attached_aio_context,
                                          detach_aio_context, opaque);
@@ -2147,11 +2216,13 @@ void blk_remove_aio_context_notifier(BlockBackend *blk,
 
 void blk_add_remove_bs_notifier(BlockBackend *blk, Notifier *notify)
 {
+    assert(qemu_in_main_thread());
     notifier_list_add(&blk->remove_bs_notifiers, notify);
 }
 
 void blk_add_insert_bs_notifier(BlockBackend *blk, Notifier *notify)
 {
+    assert(qemu_in_main_thread());
     notifier_list_add(&blk->insert_bs_notifiers, notify);
 }
 
@@ -2214,6 +2285,7 @@ int blk_save_vmstate(BlockBackend *blk, const uint8_t *buf,
                      int64_t pos, int size)
 {
     int ret;
+    assert(qemu_in_main_thread());
 
     if (!blk_is_available(blk)) {
         return -ENOMEDIUM;
@@ -2233,6 +2305,7 @@ int blk_save_vmstate(BlockBackend *blk, const uint8_t *buf,
 
 int blk_load_vmstate(BlockBackend *blk, uint8_t *buf, int64_t pos, int size)
 {
+    assert(qemu_in_main_thread());
     if (!blk_is_available(blk)) {
         return -ENOMEDIUM;
     }
@@ -2242,6 +2315,7 @@ int blk_load_vmstate(BlockBackend *blk, uint8_t *buf, int64_t pos, int size)
 
 int blk_probe_blocksizes(BlockBackend *blk, BlockSizes *bsz)
 {
+    assert(qemu_in_main_thread());
     if (!blk_is_available(blk)) {
         return -ENOMEDIUM;
     }
@@ -2251,6 +2325,7 @@ int blk_probe_blocksizes(BlockBackend *blk, BlockSizes *bsz)
 
 int blk_probe_geometry(BlockBackend *blk, HDGeometry *geo)
 {
+    assert(qemu_in_main_thread());
     if (!blk_is_available(blk)) {
         return -ENOMEDIUM;
     }
@@ -2264,6 +2339,7 @@ int blk_probe_geometry(BlockBackend *blk, HDGeometry *geo)
  */
 void blk_update_root_state(BlockBackend *blk)
 {
+    assert(qemu_in_main_thread());
     assert(blk->root);
 
     blk->root_state.open_flags    = blk->root->bs->open_flags;
@@ -2276,6 +2352,7 @@ void blk_update_root_state(BlockBackend *blk)
  */
 bool blk_get_detect_zeroes_from_root_state(BlockBackend *blk)
 {
+    assert(qemu_in_main_thread());
     return blk->root_state.detect_zeroes;
 }
 
@@ -2285,17 +2362,20 @@ bool blk_get_detect_zeroes_from_root_state(BlockBackend *blk)
  */
 int blk_get_open_flags_from_root_state(BlockBackend *blk)
 {
+    assert(qemu_in_main_thread());
     return blk->root_state.open_flags;
 }
 
 BlockBackendRootState *blk_get_root_state(BlockBackend *blk)
 {
+    assert(qemu_in_main_thread());
     return &blk->root_state;
 }
 
 int blk_commit_all(void)
 {
     BlockBackend *blk = NULL;
+    assert(qemu_in_main_thread());
 
     while ((blk = blk_all_next(blk)) != NULL) {
         AioContext *aio_context = blk_get_aio_context(blk);
@@ -2320,6 +2400,7 @@ int blk_commit_all(void)
 /* throttling disk I/O limits */
 void blk_set_io_limits(BlockBackend *blk, ThrottleConfig *cfg)
 {
+    assert(qemu_in_main_thread());
     throttle_group_config(&blk->public.throttle_group_member, cfg);
 }
 
@@ -2328,6 +2409,7 @@ void blk_io_limits_disable(BlockBackend *blk)
     BlockDriverState *bs = blk_bs(blk);
     ThrottleGroupMember *tgm = &blk->public.throttle_group_member;
     assert(tgm->throttle_state);
+    assert(qemu_in_main_thread());
     if (bs) {
         bdrv_drained_begin(bs);
     }
@@ -2341,12 +2423,14 @@ void blk_io_limits_disable(BlockBackend *blk)
 void blk_io_limits_enable(BlockBackend *blk, const char *group)
 {
     assert(!blk->public.throttle_group_member.throttle_state);
+    assert(qemu_in_main_thread());
     throttle_group_register_tgm(&blk->public.throttle_group_member,
                                 group, blk_get_aio_context(blk));
 }
 
 void blk_io_limits_update_group(BlockBackend *blk, const char *group)
 {
+    assert(qemu_in_main_thread());
     /* this BB is not part of any group */
     if (!blk->public.throttle_group_member.throttle_state) {
         return;
@@ -2414,11 +2498,13 @@ static void blk_root_drained_end(BdrvChild *child, int *drained_end_counter)
 
 void blk_register_buf(BlockBackend *blk, void *host, size_t size)
 {
+    assert(qemu_in_main_thread());
     bdrv_register_buf(blk_bs(blk), host, size);
 }
 
 void blk_unregister_buf(BlockBackend *blk, void *host)
 {
+    assert(qemu_in_main_thread());
     bdrv_unregister_buf(blk_bs(blk), host);
 }
 
@@ -2443,11 +2529,13 @@ int coroutine_fn blk_co_copy_range(BlockBackend *blk_in, int64_t off_in,
 
 const BdrvChild *blk_root(BlockBackend *blk)
 {
+    assert(qemu_in_main_thread());
     return blk->root;
 }
 
 int blk_make_empty(BlockBackend *blk, Error **errp)
 {
+    assert(qemu_in_main_thread());
     if (!blk_is_available(blk)) {
         error_setg(errp, "No medium inserted");
         return -ENOMEDIUM;
diff --git a/softmmu/qdev-monitor.c b/softmmu/qdev-monitor.c
index 4851de51a5..71cb6a549b 100644
--- a/softmmu/qdev-monitor.c
+++ b/softmmu/qdev-monitor.c
@@ -963,6 +963,8 @@ BlockBackend *blk_by_qdev_id(const char *id, Error **errp)
     DeviceState *dev;
     BlockBackend *blk;
 
+    assert(qemu_in_main_thread());
+
     dev = find_device_state(id, errp);
     if (dev == NULL) {
         return NULL;
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 06/25] include/block/block_int: split header into I/O and global state API
  2021-10-25 10:17 [PATCH v4 00/25] block layer: split block APIs in global state and I/O Emanuele Giuseppe Esposito
                   ` (4 preceding siblings ...)
  2021-10-25 10:17 ` [PATCH v4 05/25] block/block-backend.c: assertions for block-backend Emanuele Giuseppe Esposito
@ 2021-10-25 10:17 ` Emanuele Giuseppe Esposito
  2021-11-12 12:17   ` Hanna Reitz
  2021-10-25 10:17 ` [PATCH v4 07/25] assertions for block_int " Emanuele Giuseppe Esposito
                   ` (21 subsequent siblings)
  27 siblings, 1 reply; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-10-25 10:17 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Emanuele Giuseppe Esposito, Richard Henderson, Markus Armbruster,
	Dr. David Alan Gilbert, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Eric Blake

Similarly to the previous patch, split block_int.h
in block_int-io.h and block_int-global-state.h

block_int-common.h contains the structures shared between
the two headers, and the functions that can't be categorized as
I/O or global state.

Assertions are added in the next patch.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 blockdev.c                             |    5 +
 include/block/block_int-common.h       | 1164 +++++++++++++++++++
 include/block/block_int-global-state.h |  319 +++++
 include/block/block_int-io.h           |  163 +++
 include/block/block_int.h              | 1478 +-----------------------
 5 files changed, 1654 insertions(+), 1475 deletions(-)
 create mode 100644 include/block/block_int-common.h
 create mode 100644 include/block/block_int-global-state.h
 create mode 100644 include/block/block_int-io.h

diff --git a/blockdev.c b/blockdev.c
index ae322ed10e..ddba382abd 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -63,6 +63,7 @@
 #include "qemu/main-loop.h"
 #include "qemu/throttle-options.h"
 
+/* Protected by BQL lock */
 QTAILQ_HEAD(, BlockDriverState) monitor_bdrv_states =
     QTAILQ_HEAD_INITIALIZER(monitor_bdrv_states);
 
@@ -1207,6 +1208,8 @@ typedef struct BlkActionState BlkActionState;
  *
  * Only prepare() may fail. In a single transaction, only one of commit() or
  * abort() will be called. clean() will always be called if it is present.
+ *
+ * Always run under BQL.
  */
 typedef struct BlkActionOps {
     size_t instance_size;
@@ -2316,6 +2319,8 @@ static TransactionProperties *get_transaction_properties(
 /*
  * 'Atomic' group operations.  The operations are performed as a set, and if
  * any fail then we roll back all operations in the group.
+ *
+ * Always run under BQL.
  */
 void qmp_transaction(TransactionActionList *dev_list,
                      bool has_props,
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
new file mode 100644
index 0000000000..79a3d801d2
--- /dev/null
+++ b/include/block/block_int-common.h
@@ -0,0 +1,1164 @@
+/*
+ * QEMU System Emulator block driver
+ *
+ * Copyright (c) 2003 Fabrice Bellard
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+#ifndef BLOCK_INT_COMMON_H
+#define BLOCK_INT_COMMON_H
+
+#include "block/accounting.h"
+#include "block/block.h"
+#include "block/aio-wait.h"
+#include "qemu/queue.h"
+#include "qemu/coroutine.h"
+#include "qemu/stats64.h"
+#include "qemu/timer.h"
+#include "qemu/hbitmap.h"
+#include "block/snapshot.h"
+#include "qemu/throttle.h"
+#include "qemu/rcu.h"
+
+#define BLOCK_FLAG_LAZY_REFCOUNTS   8
+
+#define BLOCK_OPT_SIZE              "size"
+#define BLOCK_OPT_ENCRYPT           "encryption"
+#define BLOCK_OPT_ENCRYPT_FORMAT    "encrypt.format"
+#define BLOCK_OPT_COMPAT6           "compat6"
+#define BLOCK_OPT_HWVERSION         "hwversion"
+#define BLOCK_OPT_BACKING_FILE      "backing_file"
+#define BLOCK_OPT_BACKING_FMT       "backing_fmt"
+#define BLOCK_OPT_CLUSTER_SIZE      "cluster_size"
+#define BLOCK_OPT_TABLE_SIZE        "table_size"
+#define BLOCK_OPT_PREALLOC          "preallocation"
+#define BLOCK_OPT_SUBFMT            "subformat"
+#define BLOCK_OPT_COMPAT_LEVEL      "compat"
+#define BLOCK_OPT_LAZY_REFCOUNTS    "lazy_refcounts"
+#define BLOCK_OPT_ADAPTER_TYPE      "adapter_type"
+#define BLOCK_OPT_REDUNDANCY        "redundancy"
+#define BLOCK_OPT_NOCOW             "nocow"
+#define BLOCK_OPT_EXTENT_SIZE_HINT  "extent_size_hint"
+#define BLOCK_OPT_OBJECT_SIZE       "object_size"
+#define BLOCK_OPT_REFCOUNT_BITS     "refcount_bits"
+#define BLOCK_OPT_DATA_FILE         "data_file"
+#define BLOCK_OPT_DATA_FILE_RAW     "data_file_raw"
+#define BLOCK_OPT_COMPRESSION_TYPE  "compression_type"
+#define BLOCK_OPT_EXTL2             "extended_l2"
+
+#define BLOCK_PROBE_BUF_SIZE        512
+
+enum BdrvTrackedRequestType {
+    BDRV_TRACKED_READ,
+    BDRV_TRACKED_WRITE,
+    BDRV_TRACKED_DISCARD,
+    BDRV_TRACKED_TRUNCATE,
+};
+
+/*
+ * That is not quite good that BdrvTrackedRequest structure is public,
+ * as block/io.c is very careful about incoming offset/bytes being
+ * correct. Be sure to assert bdrv_check_request() succeeded after any
+ * modification of BdrvTrackedRequest object out of block/io.c
+ */
+typedef struct BdrvTrackedRequest {
+    BlockDriverState *bs;
+    int64_t offset;
+    int64_t bytes;
+    enum BdrvTrackedRequestType type;
+
+    bool serialising;
+    int64_t overlap_offset;
+    int64_t overlap_bytes;
+
+    QLIST_ENTRY(BdrvTrackedRequest) list;
+    Coroutine *co; /* owner, used for deadlock detection */
+    CoQueue wait_queue; /* coroutines blocked on this request */
+
+    struct BdrvTrackedRequest *waiting_for;
+} BdrvTrackedRequest;
+
+
+struct BlockDriver {
+    const char *format_name;
+    int instance_size;
+
+    /*
+     * Set to true if the BlockDriver is a block filter. Block filters pass
+     * certain callbacks that refer to data (see block.c) to their bs->file
+     * or bs->backing (whichever one exists) if the driver doesn't implement
+     * them. Drivers that do not wish to forward must implement them and return
+     * -ENOTSUP.
+     * Note that filters are not allowed to modify data.
+     *
+     * Filters generally cannot have more than a single filtered child,
+     * because the data they present must at all times be the same as
+     * that on their filtered child.  That would be impossible to
+     * achieve for multiple filtered children.
+     * (And this filtered child must then be bs->file or bs->backing.)
+     */
+    bool is_filter;
+    /*
+     * Set to true if the BlockDriver is a format driver.  Format nodes
+     * generally do not expect their children to be other format nodes
+     * (except for backing files), and so format probing is disabled
+     * on those children.
+     */
+    bool is_format;
+    /*
+     * Return true if @to_replace can be replaced by a BDS with the
+     * same data as @bs without it affecting @bs's behavior (that is,
+     * without it being visible to @bs's parents).
+     */
+    bool (*bdrv_recurse_can_replace)(BlockDriverState *bs,
+                                     BlockDriverState *to_replace);
+
+    int (*bdrv_probe)(const uint8_t *buf, int buf_size, const char *filename);
+    int (*bdrv_probe_device)(const char *filename);
+
+    /*
+     * Any driver implementing this callback is expected to be able to handle
+     * NULL file names in its .bdrv_open() implementation.
+     */
+    void (*bdrv_parse_filename)(const char *filename, QDict *options,
+                                Error **errp);
+    /*
+     * Drivers not implementing bdrv_parse_filename nor bdrv_open should have
+     * this field set to true, except ones that are defined only by their
+     * child's bs.
+     * An example of the last type will be the quorum block driver.
+     */
+    bool bdrv_needs_filename;
+
+    /*
+     * Set if a driver can support backing files. This also implies the
+     * following semantics:
+     *
+     *  - Return status 0 of .bdrv_co_block_status means that corresponding
+     *    blocks are not allocated in this layer of backing-chain
+     *  - For such (unallocated) blocks, read will:
+     *    - fill buffer with zeros if there is no backing file
+     *    - read from the backing file otherwise, where the block layer
+     *      takes care of reading zeros beyond EOF if backing file is short
+     */
+    bool supports_backing;
+
+    /* For handling image reopen for split or non-split files */
+    int (*bdrv_reopen_prepare)(BDRVReopenState *reopen_state,
+                               BlockReopenQueue *queue, Error **errp);
+    void (*bdrv_reopen_commit)(BDRVReopenState *reopen_state);
+    void (*bdrv_reopen_commit_post)(BDRVReopenState *reopen_state);
+    void (*bdrv_reopen_abort)(BDRVReopenState *reopen_state);
+    void (*bdrv_join_options)(QDict *options, QDict *old_options);
+
+    int (*bdrv_open)(BlockDriverState *bs, QDict *options, int flags,
+                     Error **errp);
+
+    /* Protocol drivers should implement this instead of bdrv_open */
+    int (*bdrv_file_open)(BlockDriverState *bs, QDict *options, int flags,
+                          Error **errp);
+    void (*bdrv_close)(BlockDriverState *bs);
+
+
+    int coroutine_fn (*bdrv_co_create)(BlockdevCreateOptions *opts,
+                                       Error **errp);
+    int coroutine_fn (*bdrv_co_create_opts)(BlockDriver *drv,
+                                            const char *filename,
+                                            QemuOpts *opts,
+                                            Error **errp);
+
+    int coroutine_fn (*bdrv_co_amend)(BlockDriverState *bs,
+                                      BlockdevAmendOptions *opts,
+                                      bool force,
+                                      Error **errp);
+
+    int (*bdrv_amend_options)(BlockDriverState *bs,
+                              QemuOpts *opts,
+                              BlockDriverAmendStatusCB *status_cb,
+                              void *cb_opaque,
+                              bool force,
+                              Error **errp);
+
+    int (*bdrv_make_empty)(BlockDriverState *bs);
+
+    /*
+     * Refreshes the bs->exact_filename field. If that is impossible,
+     * bs->exact_filename has to be left empty.
+     */
+    void (*bdrv_refresh_filename)(BlockDriverState *bs);
+
+    /*
+     * Gathers the open options for all children into @target.
+     * A simple format driver (without backing file support) might
+     * implement this function like this:
+     *
+     *     QINCREF(bs->file->bs->full_open_options);
+     *     qdict_put(target, "file", bs->file->bs->full_open_options);
+     *
+     * If not specified, the generic implementation will simply put
+     * all children's options under their respective name.
+     *
+     * @backing_overridden is true when bs->backing seems not to be
+     * the child that would result from opening bs->backing_file.
+     * Therefore, if it is true, the backing child's options should be
+     * gathered; otherwise, there is no need since the backing child
+     * is the one implied by the image header.
+     *
+     * Note that ideally this function would not be needed.  Every
+     * block driver which implements it is probably doing something
+     * shady regarding its runtime option structure.
+     */
+    void (*bdrv_gather_child_options)(BlockDriverState *bs, QDict *target,
+                                      bool backing_overridden);
+
+    /*
+     * Returns an allocated string which is the directory name of this BDS: It
+     * will be used to make relative filenames absolute by prepending this
+     * function's return value to them.
+     */
+    char *(*bdrv_dirname)(BlockDriverState *bs, Error **errp);
+
+    /* aio */
+    BlockAIOCB *(*bdrv_aio_preadv)(BlockDriverState *bs,
+        int64_t offset, int64_t bytes, QEMUIOVector *qiov,
+        BdrvRequestFlags flags, BlockCompletionFunc *cb, void *opaque);
+    BlockAIOCB *(*bdrv_aio_pwritev)(BlockDriverState *bs,
+        int64_t offset, int64_t bytes, QEMUIOVector *qiov,
+        BdrvRequestFlags flags, BlockCompletionFunc *cb, void *opaque);
+    BlockAIOCB *(*bdrv_aio_flush)(BlockDriverState *bs,
+        BlockCompletionFunc *cb, void *opaque);
+    BlockAIOCB *(*bdrv_aio_pdiscard)(BlockDriverState *bs,
+        int64_t offset, int bytes,
+        BlockCompletionFunc *cb, void *opaque);
+
+    int coroutine_fn (*bdrv_co_readv)(BlockDriverState *bs,
+        int64_t sector_num, int nb_sectors, QEMUIOVector *qiov);
+
+    /**
+     * @offset: position in bytes to read at
+     * @bytes: number of bytes to read
+     * @qiov: the buffers to fill with read data
+     * @flags: currently unused, always 0
+     *
+     * @offset and @bytes will be a multiple of 'request_alignment',
+     * but the length of individual @qiov elements does not have to
+     * be a multiple.
+     *
+     * @bytes will always equal the total size of @qiov, and will be
+     * no larger than 'max_transfer'.
+     *
+     * The buffer in @qiov may point directly to guest memory.
+     */
+    int coroutine_fn (*bdrv_co_preadv)(BlockDriverState *bs,
+        int64_t offset, int64_t bytes, QEMUIOVector *qiov,
+        BdrvRequestFlags flags);
+
+    int coroutine_fn (*bdrv_co_preadv_part)(BlockDriverState *bs,
+        int64_t offset, int64_t bytes,
+        QEMUIOVector *qiov, size_t qiov_offset,
+        BdrvRequestFlags flags);
+
+    int coroutine_fn (*bdrv_co_writev)(BlockDriverState *bs,
+        int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
+        int flags);
+    /**
+     * @offset: position in bytes to write at
+     * @bytes: number of bytes to write
+     * @qiov: the buffers containing data to write
+     * @flags: zero or more bits allowed by 'supported_write_flags'
+     *
+     * @offset and @bytes will be a multiple of 'request_alignment',
+     * but the length of individual @qiov elements does not have to
+     * be a multiple.
+     *
+     * @bytes will always equal the total size of @qiov, and will be
+     * no larger than 'max_transfer'.
+     *
+     * The buffer in @qiov may point directly to guest memory.
+     */
+    int coroutine_fn (*bdrv_co_pwritev)(BlockDriverState *bs,
+        int64_t offset, int64_t bytes, QEMUIOVector *qiov,
+        BdrvRequestFlags flags);
+    int coroutine_fn (*bdrv_co_pwritev_part)(BlockDriverState *bs,
+        int64_t offset, int64_t bytes, QEMUIOVector *qiov, size_t qiov_offset,
+        BdrvRequestFlags flags);
+
+    /*
+     * Efficiently zero a region of the disk image.  Typically an image format
+     * would use a compact metadata representation to implement this.  This
+     * function pointer may be NULL or return -ENOSUP and .bdrv_co_writev()
+     * will be called instead.
+     */
+    int coroutine_fn (*bdrv_co_pwrite_zeroes)(BlockDriverState *bs,
+        int64_t offset, int64_t bytes, BdrvRequestFlags flags);
+    int coroutine_fn (*bdrv_co_pdiscard)(BlockDriverState *bs,
+        int64_t offset, int64_t bytes);
+
+    /*
+     * Map [offset, offset + nbytes) range onto a child of @bs to copy from,
+     * and invoke bdrv_co_copy_range_from(child, ...), or invoke
+     * bdrv_co_copy_range_to() if @bs is the leaf child to copy data from.
+     *
+     * See the comment of bdrv_co_copy_range for the parameter and return value
+     * semantics.
+     */
+    int coroutine_fn (*bdrv_co_copy_range_from)(BlockDriverState *bs,
+                                                BdrvChild *src,
+                                                int64_t offset,
+                                                BdrvChild *dst,
+                                                int64_t dst_offset,
+                                                int64_t bytes,
+                                                BdrvRequestFlags read_flags,
+                                                BdrvRequestFlags write_flags);
+
+    /*
+     * Map [offset, offset + nbytes) range onto a child of bs to copy data to,
+     * and invoke bdrv_co_copy_range_to(child, src, ...), or perform the copy
+     * operation if @bs is the leaf and @src has the same BlockDriver.  Return
+     * -ENOTSUP if @bs is the leaf but @src has a different BlockDriver.
+     *
+     * See the comment of bdrv_co_copy_range for the parameter and return value
+     * semantics.
+     */
+    int coroutine_fn (*bdrv_co_copy_range_to)(BlockDriverState *bs,
+                                              BdrvChild *src,
+                                              int64_t src_offset,
+                                              BdrvChild *dst,
+                                              int64_t dst_offset,
+                                              int64_t bytes,
+                                              BdrvRequestFlags read_flags,
+                                              BdrvRequestFlags write_flags);
+
+    /*
+     * Building block for bdrv_block_status[_above] and
+     * bdrv_is_allocated[_above].  The driver should answer only
+     * according to the current layer, and should only need to set
+     * BDRV_BLOCK_DATA, BDRV_BLOCK_ZERO, BDRV_BLOCK_OFFSET_VALID,
+     * and/or BDRV_BLOCK_RAW; if the current layer defers to a backing
+     * layer, the result should be 0 (and not BDRV_BLOCK_ZERO).  See
+     * block.h for the overall meaning of the bits.  As a hint, the
+     * flag want_zero is true if the caller cares more about precise
+     * mappings (favor accurate _OFFSET_VALID/_ZERO) or false for
+     * overall allocation (favor larger *pnum, perhaps by reporting
+     * _DATA instead of _ZERO).  The block layer guarantees input
+     * clamped to bdrv_getlength() and aligned to request_alignment,
+     * as well as non-NULL pnum, map, and file; in turn, the driver
+     * must return an error or set pnum to an aligned non-zero value.
+     *
+     * Note that @bytes is just a hint on how big of a region the
+     * caller wants to inspect.  It is not a limit on *pnum.
+     * Implementations are free to return larger values of *pnum if
+     * doing so does not incur a performance penalty.
+     *
+     * block/io.c's bdrv_co_block_status() will utilize an unclamped
+     * *pnum value for the block-status cache on protocol nodes, prior
+     * to clamping *pnum for return to its caller.
+     */
+    int coroutine_fn (*bdrv_co_block_status)(BlockDriverState *bs,
+        bool want_zero, int64_t offset, int64_t bytes, int64_t *pnum,
+        int64_t *map, BlockDriverState **file);
+
+    /*
+     * This informs the driver that we are no longer interested in the result
+     * of in-flight requests, so don't waste the time if possible.
+     *
+     * One example usage is to avoid waiting for an nbd target node reconnect
+     * timeout during job-cancel with force=true.
+     */
+    void (*bdrv_cancel_in_flight)(BlockDriverState *bs);
+
+    /*
+     * Invalidate any cached meta-data.
+     */
+    void coroutine_fn (*bdrv_co_invalidate_cache)(BlockDriverState *bs,
+                                                  Error **errp);
+    int (*bdrv_inactivate)(BlockDriverState *bs);
+
+    /*
+     * Flushes all data for all layers by calling bdrv_co_flush for underlying
+     * layers, if needed. This function is needed for deterministic
+     * synchronization of the flush finishing callback.
+     */
+    int coroutine_fn (*bdrv_co_flush)(BlockDriverState *bs);
+
+    /* Delete a created file. */
+    int coroutine_fn (*bdrv_co_delete_file)(BlockDriverState *bs,
+                                            Error **errp);
+
+    /*
+     * Flushes all data that was already written to the OS all the way down to
+     * the disk (for example file-posix.c calls fsync()).
+     */
+    int coroutine_fn (*bdrv_co_flush_to_disk)(BlockDriverState *bs);
+
+    /*
+     * Flushes all internal caches to the OS. The data may still sit in a
+     * writeback cache of the host OS, but it will survive a crash of the qemu
+     * process.
+     */
+    int coroutine_fn (*bdrv_co_flush_to_os)(BlockDriverState *bs);
+
+    /*
+     * Drivers setting this field must be able to work with just a plain
+     * filename with '<protocol_name>:' as a prefix, and no other options.
+     * Options may be extracted from the filename by implementing
+     * bdrv_parse_filename.
+     */
+    const char *protocol_name;
+
+    /*
+     * Truncate @bs to @offset bytes using the given @prealloc mode
+     * when growing.  Modes other than PREALLOC_MODE_OFF should be
+     * rejected when shrinking @bs.
+     *
+     * If @exact is true, @bs must be resized to exactly @offset.
+     * Otherwise, it is sufficient for @bs (if it is a host block
+     * device and thus there is no way to resize it) to be at least
+     * @offset bytes in length.
+     *
+     * If @exact is true and this function fails but would succeed
+     * with @exact = false, it should return -ENOTSUP.
+     */
+    int coroutine_fn (*bdrv_co_truncate)(BlockDriverState *bs, int64_t offset,
+                                         bool exact, PreallocMode prealloc,
+                                         BdrvRequestFlags flags, Error **errp);
+    int64_t (*bdrv_getlength)(BlockDriverState *bs);
+    bool has_variable_length;
+    int64_t (*bdrv_get_allocated_file_size)(BlockDriverState *bs);
+    BlockMeasureInfo *(*bdrv_measure)(QemuOpts *opts, BlockDriverState *in_bs,
+                                      Error **errp);
+
+    int coroutine_fn (*bdrv_co_pwritev_compressed)(BlockDriverState *bs,
+        int64_t offset, int64_t bytes, QEMUIOVector *qiov);
+    int coroutine_fn (*bdrv_co_pwritev_compressed_part)(BlockDriverState *bs,
+        int64_t offset, int64_t bytes, QEMUIOVector *qiov,
+        size_t qiov_offset);
+
+    int (*bdrv_snapshot_create)(BlockDriverState *bs,
+                                QEMUSnapshotInfo *sn_info);
+    int (*bdrv_snapshot_goto)(BlockDriverState *bs,
+                              const char *snapshot_id);
+    int (*bdrv_snapshot_delete)(BlockDriverState *bs,
+                                const char *snapshot_id,
+                                const char *name,
+                                Error **errp);
+    int (*bdrv_snapshot_list)(BlockDriverState *bs,
+                              QEMUSnapshotInfo **psn_info);
+    int (*bdrv_snapshot_load_tmp)(BlockDriverState *bs,
+                                  const char *snapshot_id,
+                                  const char *name,
+                                  Error **errp);
+    int (*bdrv_get_info)(BlockDriverState *bs, BlockDriverInfo *bdi);
+
+    ImageInfoSpecific *(*bdrv_get_specific_info)(BlockDriverState *bs,
+                                                 Error **errp);
+    BlockStatsSpecific *(*bdrv_get_specific_stats)(BlockDriverState *bs);
+
+    int coroutine_fn (*bdrv_save_vmstate)(BlockDriverState *bs,
+                                          QEMUIOVector *qiov,
+                                          int64_t pos);
+    int coroutine_fn (*bdrv_load_vmstate)(BlockDriverState *bs,
+                                          QEMUIOVector *qiov,
+                                          int64_t pos);
+
+    int (*bdrv_change_backing_file)(BlockDriverState *bs,
+        const char *backing_file, const char *backing_fmt);
+
+    /* removable device specific */
+    bool (*bdrv_is_inserted)(BlockDriverState *bs);
+    void (*bdrv_eject)(BlockDriverState *bs, bool eject_flag);
+    void (*bdrv_lock_medium)(BlockDriverState *bs, bool locked);
+
+    /* to control generic scsi devices */
+    BlockAIOCB *(*bdrv_aio_ioctl)(BlockDriverState *bs,
+        unsigned long int req, void *buf,
+        BlockCompletionFunc *cb, void *opaque);
+    int coroutine_fn (*bdrv_co_ioctl)(BlockDriverState *bs,
+                                      unsigned long int req, void *buf);
+
+    /* List of options for creating images, terminated by name == NULL */
+    QemuOptsList *create_opts;
+
+    /* List of options for image amend */
+    QemuOptsList *amend_opts;
+
+    /*
+     * If this driver supports reopening images this contains a
+     * NULL-terminated list of the runtime options that can be
+     * modified. If an option in this list is unspecified during
+     * reopen then it _must_ be reset to its default value or return
+     * an error.
+     */
+    const char *const *mutable_opts;
+
+    /*
+     * Returns 0 for completed check, -errno for internal errors.
+     * The check results are stored in result.
+     */
+    int coroutine_fn (*bdrv_co_check)(BlockDriverState *bs,
+                                      BdrvCheckResult *result,
+                                      BdrvCheckMode fix);
+
+    void (*bdrv_debug_event)(BlockDriverState *bs, BlkdebugEvent event);
+
+    /* TODO Better pass a option string/QDict/QemuOpts to add any rule? */
+    int (*bdrv_debug_breakpoint)(BlockDriverState *bs, const char *event,
+        const char *tag);
+    int (*bdrv_debug_remove_breakpoint)(BlockDriverState *bs,
+        const char *tag);
+    int (*bdrv_debug_resume)(BlockDriverState *bs, const char *tag);
+    bool (*bdrv_debug_is_suspended)(BlockDriverState *bs, const char *tag);
+
+    void (*bdrv_refresh_limits)(BlockDriverState *bs, Error **errp);
+
+    /*
+     * Returns 1 if newly created images are guaranteed to contain only
+     * zeros, 0 otherwise.
+     */
+    int (*bdrv_has_zero_init)(BlockDriverState *bs);
+
+    /*
+     * Remove fd handlers, timers, and other event loop callbacks so the event
+     * loop is no longer in use.  Called with no in-flight requests and in
+     * depth-first traversal order with parents before child nodes.
+     */
+    void (*bdrv_detach_aio_context)(BlockDriverState *bs);
+
+    /*
+     * Add fd handlers, timers, and other event loop callbacks so I/O requests
+     * can be processed again.  Called with no in-flight requests and in
+     * depth-first traversal order with child nodes before parent nodes.
+     */
+    void (*bdrv_attach_aio_context)(BlockDriverState *bs,
+                                    AioContext *new_context);
+
+    /* io queue for linux-aio */
+    void (*bdrv_io_plug)(BlockDriverState *bs);
+    void (*bdrv_io_unplug)(BlockDriverState *bs);
+
+    /**
+     * Try to get @bs's logical and physical block size.
+     * On success, store them in @bsz and return zero.
+     * On failure, return negative errno.
+     */
+    /* I/O API, even though if it's a filter jumps on parent */
+    int (*bdrv_probe_blocksizes)(BlockDriverState *bs, BlockSizes *bsz);
+    /**
+     * Try to get @bs's geometry (cyls, heads, sectors)
+     * On success, store them in @geo and return 0.
+     * On failure return -errno.
+     * Only drivers that want to override guest geometry implement this
+     * callback; see hd_geometry_guess().
+     */
+    /* I/O API, even though if it's a filter jumps on parent */
+    int (*bdrv_probe_geometry)(BlockDriverState *bs, HDGeometry *geo);
+
+    /**
+     * bdrv_co_drain_begin is called if implemented in the beginning of a
+     * drain operation to drain and stop any internal sources of requests in
+     * the driver.
+     * bdrv_co_drain_end is called if implemented at the end of the drain.
+     *
+     * They should be used by the driver to e.g. manage scheduled I/O
+     * requests, or toggle an internal state. After the end of the drain new
+     * requests will continue normally.
+     */
+    void coroutine_fn (*bdrv_co_drain_begin)(BlockDriverState *bs);
+    void coroutine_fn (*bdrv_co_drain_end)(BlockDriverState *bs);
+
+    void (*bdrv_add_child)(BlockDriverState *parent, BlockDriverState *child,
+                           Error **errp);
+    void (*bdrv_del_child)(BlockDriverState *parent, BdrvChild *child,
+                           Error **errp);
+
+    /**
+     * Informs the block driver that a permission change is intended. The
+     * driver checks whether the change is permissible and may take other
+     * preparations for the change (e.g. get file system locks). This operation
+     * is always followed either by a call to either .bdrv_set_perm or
+     * .bdrv_abort_perm_update.
+     *
+     * Checks whether the requested set of cumulative permissions in @perm
+     * can be granted for accessing @bs and whether no other users are using
+     * permissions other than those given in @shared (both arguments take
+     * BLK_PERM_* bitmasks).
+     *
+     * If both conditions are met, 0 is returned. Otherwise, -errno is returned
+     * and errp is set to an error describing the conflict.
+     */
+    int (*bdrv_check_perm)(BlockDriverState *bs, uint64_t perm,
+                           uint64_t shared, Error **errp);
+
+    /**
+     * Called to inform the driver that the set of cumulative set of used
+     * permissions for @bs has changed to @perm, and the set of sharable
+     * permission to @shared. The driver can use this to propagate changes to
+     * its children (i.e. request permissions only if a parent actually needs
+     * them).
+     *
+     * This function is only invoked after bdrv_check_perm(), so block drivers
+     * may rely on preparations made in their .bdrv_check_perm implementation.
+     */
+    void (*bdrv_set_perm)(BlockDriverState *bs, uint64_t perm, uint64_t shared);
+
+    /*
+     * Called to inform the driver that after a previous bdrv_check_perm()
+     * call, the permission update is not performed and any preparations made
+     * for it (e.g. taken file locks) need to be undone.
+     *
+     * This function can be called even for nodes that never saw a
+     * bdrv_check_perm() call. It is a no-op then.
+     */
+    void (*bdrv_abort_perm_update)(BlockDriverState *bs);
+
+    /**
+     * Returns in @nperm and @nshared the permissions that the driver for @bs
+     * needs on its child @c, based on the cumulative permissions requested by
+     * the parents in @parent_perm and @parent_shared.
+     *
+     * If @c is NULL, return the permissions for attaching a new child for the
+     * given @child_class and @role.
+     *
+     * If @reopen_queue is non-NULL, don't return the currently needed
+     * permissions, but those that will be needed after applying the
+     * @reopen_queue.
+     */
+     void (*bdrv_child_perm)(BlockDriverState *bs, BdrvChild *c,
+                             BdrvChildRole role,
+                             BlockReopenQueue *reopen_queue,
+                             uint64_t parent_perm, uint64_t parent_shared,
+                             uint64_t *nperm, uint64_t *nshared);
+
+    bool (*bdrv_supports_persistent_dirty_bitmap)(BlockDriverState *bs);
+    bool (*bdrv_co_can_store_new_dirty_bitmap)(BlockDriverState *bs,
+                                               const char *name,
+                                               uint32_t granularity,
+                                               Error **errp);
+    int (*bdrv_co_remove_persistent_dirty_bitmap)(BlockDriverState *bs,
+                                                  const char *name,
+                                                  Error **errp);
+
+    /**
+     * Register/unregister a buffer for I/O. For example, when the driver is
+     * interested to know the memory areas that will later be used in iovs, so
+     * that it can do IOMMU mapping with VFIO etc., in order to get better
+     * performance. In the case of VFIO drivers, this callback is used to do
+     * DMA mapping for hot buffers.
+     */
+    void (*bdrv_register_buf)(BlockDriverState *bs, void *host, size_t size);
+    void (*bdrv_unregister_buf)(BlockDriverState *bs, void *host);
+    QLIST_ENTRY(BlockDriver) list;
+
+    /*
+     * Pointer to a NULL-terminated array of names of strong options
+     * that can be specified for bdrv_open(). A strong option is one
+     * that changes the data of a BDS.
+     * If this pointer is NULL, the array is considered empty.
+     * "filename" and "driver" are always considered strong.
+     */
+    const char *const *strong_runtime_opts;
+};
+
+static inline bool block_driver_can_compress(BlockDriver *drv)
+{
+    return drv->bdrv_co_pwritev_compressed ||
+           drv->bdrv_co_pwritev_compressed_part;
+}
+
+typedef struct BlockLimits {
+    /*
+     * Alignment requirement, in bytes, for offset/length of I/O
+     * requests. Must be a power of 2 less than INT_MAX; defaults to
+     * 1 for drivers with modern byte interfaces, and to 512
+     * otherwise.
+     */
+    uint32_t request_alignment;
+
+    /*
+     * Maximum number of bytes that can be discarded at once. Must be multiple
+     * of pdiscard_alignment, but need not be power of 2. May be 0 if no
+     * inherent 64-bit limit.
+     */
+    int64_t max_pdiscard;
+
+    /*
+     * Optimal alignment for discard requests in bytes. A power of 2
+     * is best but not mandatory.  Must be a multiple of
+     * bl.request_alignment, and must be less than max_pdiscard if
+     * that is set. May be 0 if bl.request_alignment is good enough
+     */
+    uint32_t pdiscard_alignment;
+
+    /*
+     * Maximum number of bytes that can zeroized at once. Must be multiple of
+     * pwrite_zeroes_alignment. 0 means no limit.
+     */
+    int64_t max_pwrite_zeroes;
+
+    /*
+     * Optimal alignment for write zeroes requests in bytes. A power
+     * of 2 is best but not mandatory.  Must be a multiple of
+     * bl.request_alignment, and must be less than max_pwrite_zeroes
+     * if that is set. May be 0 if bl.request_alignment is good
+     * enough
+     */
+    uint32_t pwrite_zeroes_alignment;
+
+    /*
+     * Optimal transfer length in bytes.  A power of 2 is best but not
+     * mandatory.  Must be a multiple of bl.request_alignment, or 0 if
+     * no preferred size
+     */
+    uint32_t opt_transfer;
+
+    /*
+     * Maximal transfer length in bytes.  Need not be power of 2, but
+     * must be multiple of opt_transfer and bl.request_alignment, or 0
+     * for no 32-bit limit.  For now, anything larger than INT_MAX is
+     * clamped down.
+     */
+    uint32_t max_transfer;
+
+    /*
+     * Maximal hardware transfer length in bytes.  Applies whenever
+     * transfers to the device bypass the kernel I/O scheduler, for
+     * example with SG_IO.  If larger than max_transfer or if zero,
+     * blk_get_max_hw_transfer will fall back to max_transfer.
+     */
+    uint64_t max_hw_transfer;
+
+    /*
+     * Maximal number of scatter/gather elements allowed by the hardware.
+     * Applies whenever transfers to the device bypass the kernel I/O
+     * scheduler, for example with SG_IO.  If larger than max_iov
+     * or if zero, blk_get_max_hw_iov will fall back to max_iov.
+     */
+    int max_hw_iov;
+
+
+    /* memory alignment, in bytes so that no bounce buffer is needed */
+    size_t min_mem_alignment;
+
+    /* memory alignment, in bytes, for bounce buffer */
+    size_t opt_mem_alignment;
+
+    /* maximum number of iovec elements */
+    int max_iov;
+} BlockLimits;
+
+typedef struct BdrvOpBlocker BdrvOpBlocker;
+
+typedef struct BdrvAioNotifier {
+    void (*attached_aio_context)(AioContext *new_context, void *opaque);
+    void (*detach_aio_context)(void *opaque);
+
+    void *opaque;
+    bool deleted;
+
+    QLIST_ENTRY(BdrvAioNotifier) list;
+} BdrvAioNotifier;
+
+struct BdrvChildClass {
+    /*
+     * If true, bdrv_replace_node() doesn't change the node this BdrvChild
+     * points to.
+     */
+    bool stay_at_node;
+
+    /*
+     * If true, the parent is a BlockDriverState and bdrv_next_all_states()
+     * will return it. This information is used for drain_all, where every node
+     * will be drained separately, so the drain only needs to be propagated to
+     * non-BDS parents.
+     */
+    bool parent_is_bds;
+
+    void (*inherit_options)(BdrvChildRole role, bool parent_is_format,
+                            int *child_flags, QDict *child_options,
+                            int parent_flags, QDict *parent_options);
+
+    void (*change_media)(BdrvChild *child, bool load);
+    void (*resize)(BdrvChild *child);
+
+    /*
+     * Returns a name that is supposedly more useful for human users than the
+     * node name for identifying the node in question (in particular, a BB
+     * name), or NULL if the parent can't provide a better name.
+     */
+    const char *(*get_name)(BdrvChild *child);
+
+    /*
+     * Returns a malloced string that describes the parent of the child for a
+     * human reader. This could be a node-name, BlockBackend name, qdev ID or
+     * QOM path of the device owning the BlockBackend, job type and ID etc. The
+     * caller is responsible for freeing the memory.
+     */
+    char *(*get_parent_desc)(BdrvChild *child);
+
+    /*
+     * If this pair of functions is implemented, the parent doesn't issue new
+     * requests after returning from .drained_begin() until .drained_end() is
+     * called.
+     *
+     * These functions must not change the graph (and therefore also must not
+     * call aio_poll(), which could change the graph indirectly).
+     *
+     * If drained_end() schedules background operations, it must atomically
+     * increment *drained_end_counter for each such operation and atomically
+     * decrement it once the operation has settled.
+     *
+     * Note that this can be nested. If drained_begin() was called twice, new
+     * I/O is allowed only after drained_end() was called twice, too.
+     */
+    void (*drained_begin)(BdrvChild *child);
+    void (*drained_end)(BdrvChild *child, int *drained_end_counter);
+
+    /*
+     * Returns whether the parent has pending requests for the child. This
+     * callback is polled after .drained_begin() has been called until all
+     * activity on the child has stopped.
+     */
+    bool (*drained_poll)(BdrvChild *child);
+
+    /*
+     * Notifies the parent that the child has been activated/inactivated (e.g.
+     * when migration is completing) and it can start/stop requesting
+     * permissions and doing I/O on it.
+     */
+    void (*activate)(BdrvChild *child, Error **errp);
+    int (*inactivate)(BdrvChild *child);
+
+    void (*attach)(BdrvChild *child);
+    void (*detach)(BdrvChild *child);
+
+    /*
+     * Notifies the parent that the filename of its child has changed (e.g.
+     * because the direct child was removed from the backing chain), so that it
+     * can update its reference.
+     */
+    int (*update_filename)(BdrvChild *child, BlockDriverState *new_base,
+                           const char *filename, Error **errp);
+
+    bool (*can_set_aio_ctx)(BdrvChild *child, AioContext *ctx,
+                            GSList **ignore, Error **errp);
+    void (*set_aio_ctx)(BdrvChild *child, AioContext *ctx, GSList **ignore);
+
+    AioContext *(*get_parent_aio_context)(BdrvChild *child);
+};
+
+extern const BdrvChildClass child_of_bds;
+
+struct BdrvChild {
+    BlockDriverState *bs;
+    char *name;
+    const BdrvChildClass *klass;
+    BdrvChildRole role;
+    void *opaque;
+
+    /**
+     * Granted permissions for operating on this BdrvChild (BLK_PERM_* bitmask)
+     */
+    uint64_t perm;
+
+    /**
+     * Permissions that can still be granted to other users of @bs while this
+     * BdrvChild is still attached to it. (BLK_PERM_* bitmask)
+     */
+    uint64_t shared_perm;
+
+    /*
+     * This link is frozen: the child can neither be replaced nor
+     * detached from the parent.
+     */
+    bool frozen;
+
+    /*
+     * How many times the parent of this child has been drained
+     * (through klass->drained_*).
+     * Usually, this is equal to bs->quiesce_counter (potentially
+     * reduced by bdrv_drain_all_count).  It may differ while the
+     * child is entering or leaving a drained section.
+     */
+    int parent_quiesce_counter;
+
+    QLIST_ENTRY(BdrvChild) next;
+    QLIST_ENTRY(BdrvChild) next_parent;
+};
+
+/*
+ * Allows bdrv_co_block_status() to cache one data region for a
+ * protocol node.
+ *
+ * @valid: Whether the cache is valid (should be accessed with atomic
+ *         functions so this can be reset by RCU readers)
+ * @data_start: Offset where we know (or strongly assume) is data
+ * @data_end: Offset where the data region ends (which is not necessarily
+ *            the start of a zeroed region)
+ */
+typedef struct BdrvBlockStatusCache {
+    struct rcu_head rcu;
+
+    bool valid;
+    int64_t data_start;
+    int64_t data_end;
+} BdrvBlockStatusCache;
+
+struct BlockDriverState {
+    /*
+     * Protected by big QEMU lock or read-only after opening.  No special
+     * locking needed during I/O...
+     */
+    int open_flags; /* flags used to open the file, re-used for re-open */
+    bool encrypted; /* if true, the media is encrypted */
+    bool sg;        /* if true, the device is a /dev/sg* */
+    bool probed;    /* if true, format was probed rather than specified */
+    bool force_share; /* if true, always allow all shared permissions */
+    bool implicit;  /* if true, this filter node was automatically inserted */
+
+    BlockDriver *drv; /* NULL means no media */
+    void *opaque;
+
+    AioContext *aio_context; /* event loop used for fd handlers, timers, etc */
+    /*
+     * long-running tasks intended to always use the same AioContext as this
+     * BDS may register themselves in this list to be notified of changes
+     * regarding this BDS's context
+     */
+    QLIST_HEAD(, BdrvAioNotifier) aio_notifiers;
+    bool walking_aio_notifiers; /* to make removal during iteration safe */
+
+    char filename[PATH_MAX];
+    /*
+     * If not empty, this image is a diff in relation to backing_file.
+     * Note that this is the name given in the image header and
+     * therefore may or may not be equal to .backing->bs->filename.
+     * If this field contains a relative path, it is to be resolved
+     * relatively to the overlay's location.
+     */
+    char backing_file[PATH_MAX];
+    /*
+     * The backing filename indicated by the image header.  Contrary
+     * to backing_file, if we ever open this file, auto_backing_file
+     * is replaced by the resulting BDS's filename (i.e. after a
+     * bdrv_refresh_filename() run).
+     */
+    char auto_backing_file[PATH_MAX];
+    char backing_format[16]; /* if non-zero and backing_file exists */
+
+    QDict *full_open_options;
+    char exact_filename[PATH_MAX];
+
+    BdrvChild *backing;
+    BdrvChild *file;
+
+    /* I/O Limits */
+    BlockLimits bl;
+
+    /*
+     * Flags honored during pread
+     */
+    unsigned int supported_read_flags;
+    /*
+     * Flags honored during pwrite (so far: BDRV_REQ_FUA,
+     * BDRV_REQ_WRITE_UNCHANGED).
+     * If a driver does not support BDRV_REQ_WRITE_UNCHANGED, those
+     * writes will be issued as normal writes without the flag set.
+     * This is important to note for drivers that do not explicitly
+     * request a WRITE permission for their children and instead take
+     * the same permissions as their parent did (this is commonly what
+     * block filters do).  Such drivers have to be aware that the
+     * parent may have taken a WRITE_UNCHANGED permission only and is
+     * issuing such requests.  Drivers either must make sure that
+     * these requests do not result in plain WRITE accesses (usually
+     * by supporting BDRV_REQ_WRITE_UNCHANGED, and then forwarding
+     * every incoming write request as-is, including potentially that
+     * flag), or they have to explicitly take the WRITE permission for
+     * their children.
+     */
+    unsigned int supported_write_flags;
+    /*
+     * Flags honored during pwrite_zeroes (so far: BDRV_REQ_FUA,
+     * BDRV_REQ_MAY_UNMAP, BDRV_REQ_WRITE_UNCHANGED)
+     */
+    unsigned int supported_zero_flags;
+    /*
+     * Flags honoured during truncate (so far: BDRV_REQ_ZERO_WRITE).
+     *
+     * If BDRV_REQ_ZERO_WRITE is given, the truncate operation must make sure
+     * that any added space reads as all zeros. If this can't be guaranteed,
+     * the operation must fail.
+     */
+    unsigned int supported_truncate_flags;
+
+    /* the following member gives a name to every node on the bs graph. */
+    char node_name[32];
+    /* element of the list of named nodes building the graph */
+    QTAILQ_ENTRY(BlockDriverState) node_list;
+    /* element of the list of all BlockDriverStates (all_bdrv_states) */
+    QTAILQ_ENTRY(BlockDriverState) bs_list;
+    /* element of the list of monitor-owned BDS */
+    QTAILQ_ENTRY(BlockDriverState) monitor_list;
+    int refcnt;
+
+    /* operation blockers. Protected by BQL. */
+    QLIST_HEAD(, BdrvOpBlocker) op_blockers[BLOCK_OP_TYPE_MAX];
+
+    /*
+     * The node that this node inherited default options from (and a reopen on
+     * which can affect this node by changing these defaults). This is always a
+     * parent node of this node.
+     */
+    BlockDriverState *inherits_from;
+    QLIST_HEAD(, BdrvChild) children;
+    QLIST_HEAD(, BdrvChild) parents;
+
+    QDict *options;
+    QDict *explicit_options;
+    BlockdevDetectZeroesOptions detect_zeroes;
+
+    /* The error object in use for blocking operations on backing_hd */
+    Error *backing_blocker;
+
+    /* Protected by AioContext lock */
+
+    /*
+     * If we are reading a disk image, give its size in sectors.
+     * Generally read-only; it is written to by load_snapshot and
+     * save_snaphost, but the block layer is quiescent during those.
+     */
+    int64_t total_sectors;
+
+    /* threshold limit for writes, in bytes. "High water mark". */
+    uint64_t write_threshold_offset;
+
+    /*
+     * Writing to the list requires the BQL _and_ the dirty_bitmap_mutex.
+     * Reading from the list can be done with either the BQL or the
+     * dirty_bitmap_mutex.  Modifying a bitmap only requires
+     * dirty_bitmap_mutex.
+     */
+    QemuMutex dirty_bitmap_mutex;
+    QLIST_HEAD(, BdrvDirtyBitmap) dirty_bitmaps;
+
+    /* Offset after the highest byte written to */
+    Stat64 wr_highest_offset;
+
+    /*
+     * If true, copy read backing sectors into image.  Can be >1 if more
+     * than one client has requested copy-on-read.  Accessed with atomic
+     * ops.
+     */
+    int copy_on_read;
+
+    /*
+     * number of in-flight requests; overall and serialising.
+     * Accessed with atomic ops.
+     */
+    unsigned int in_flight;
+    unsigned int serialising_in_flight;
+
+    /*
+     * counter for nested bdrv_io_plug.
+     * Accessed with atomic ops.
+     */
+    unsigned io_plugged;
+
+    /* do we need to tell the quest if we have a volatile write cache? */
+    int enable_write_cache;
+
+    /* Accessed with atomic ops.  */
+    int quiesce_counter;
+    int recursive_quiesce_counter;
+
+    unsigned int write_gen;               /* Current data generation */
+
+    /* Protected by reqs_lock.  */
+    CoMutex reqs_lock;
+    QLIST_HEAD(, BdrvTrackedRequest) tracked_requests;
+    CoQueue flush_queue;                  /* Serializing flush queue */
+    bool active_flush_req;                /* Flush request in flight? */
+
+    /* Only read/written by whoever has set active_flush_req to true.  */
+    unsigned int flushed_gen;             /* Flushed write generation */
+
+    /* BdrvChild links to this node may never be frozen */
+    bool never_freeze;
+
+    /* Lock for block-status cache RCU writers */
+    CoMutex bsc_modify_lock;
+    /* Always non-NULL, but must only be dereferenced under an RCU read guard */
+    BdrvBlockStatusCache *block_status_cache;
+};
+
+struct BlockBackendRootState {
+    int open_flags;
+    BlockdevDetectZeroesOptions detect_zeroes;
+};
+
+typedef enum BlockMirrorBackingMode {
+    /*
+     * Reuse the existing backing chain from the source for the target.
+     * - sync=full: Set backing BDS to NULL.
+     * - sync=top:  Use source's backing BDS.
+     * - sync=none: Use source as the backing BDS.
+     */
+    MIRROR_SOURCE_BACKING_CHAIN,
+
+    /* Open the target's backing chain completely anew */
+    MIRROR_OPEN_BACKING_CHAIN,
+
+    /* Do not change the target's backing BDS after job completion */
+    MIRROR_LEAVE_BACKING_CHAIN,
+} BlockMirrorBackingMode;
+
+
+/*
+ * Essential block drivers which must always be statically linked into qemu, and
+ * which therefore can be accessed without using bdrv_find_format()
+ */
+extern BlockDriver bdrv_file;
+extern BlockDriver bdrv_raw;
+extern BlockDriver bdrv_qcow2;
+
+extern unsigned int bdrv_drain_all_count;
+extern QemuOptsList bdrv_create_opts_simple;
+
+/* Common functions that are neither I/O nor Global State */
+
+static inline BlockDriverState *child_bs(BdrvChild *child)
+{
+    return child ? child->bs : NULL;
+}
+
+int bdrv_check_request(int64_t offset, int64_t bytes, Error **errp);
+int get_tmp_filename(char *filename, int size);
+void bdrv_parse_filename_strip_prefix(const char *filename, const char *prefix,
+                                      QDict *options);
+
+bool bdrv_backing_overridden(BlockDriverState *bs);
+
+int bdrv_check_qiov_request(int64_t offset, int64_t bytes,
+                            QEMUIOVector *qiov, size_t qiov_offset,
+                            Error **errp);
+
+#ifdef _WIN32
+int is_windows_drive(const char *filename);
+#endif
+
+#endif /* BLOCK_INT_COMMON_H */
diff --git a/include/block/block_int-global-state.h b/include/block/block_int-global-state.h
new file mode 100644
index 0000000000..d08e80222c
--- /dev/null
+++ b/include/block/block_int-global-state.h
@@ -0,0 +1,319 @@
+/*
+ * QEMU System Emulator block driver
+ *
+ * Copyright (c) 2003 Fabrice Bellard
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+#ifndef BLOCK_INT_GLOBAL_STATE_H
+#define BLOCK_INT_GLOBAL_STATE_H
+
+#include "block_int-common.h"
+
+/*
+ * Global state (GS) API. These functions run under the BQL lock.
+ *
+ * See include/block/block-global-state.h for more information about
+ * the GS API.
+ */
+
+void bdrv_apply_subtree_drain(BdrvChild *child, BlockDriverState *new_parent);
+void bdrv_unapply_subtree_drain(BdrvChild *child, BlockDriverState *old_parent);
+
+BlockDriver *bdrv_probe_all(const uint8_t *buf, int buf_size,
+                            const char *filename);
+
+/**
+ * stream_start:
+ * @job_id: The id of the newly-created job, or %NULL to use the
+ * device name of @bs.
+ * @bs: Block device to operate on.
+ * @base: Block device that will become the new base, or %NULL to
+ * flatten the whole backing file chain onto @bs.
+ * @backing_file_str: The file name that will be written to @bs as the
+ * the new backing file if the job completes. Ignored if @base is %NULL.
+ * @creation_flags: Flags that control the behavior of the Job lifetime.
+ *                  See @BlockJobCreateFlags
+ * @speed: The maximum speed, in bytes per second, or 0 for unlimited.
+ * @on_error: The action to take upon error.
+ * @filter_node_name: The node name that should be assigned to the filter
+ *                    driver that the stream job inserts into the graph above
+ *                    @bs. NULL means that a node name should be autogenerated.
+ * @errp: Error object.
+ *
+ * Start a streaming operation on @bs.  Clusters that are unallocated
+ * in @bs, but allocated in any image between @base and @bs (both
+ * exclusive) will be written to @bs.  At the end of a successful
+ * streaming job, the backing file of @bs will be changed to
+ * @backing_file_str in the written image and to @base in the live
+ * BlockDriverState.
+ */
+void stream_start(const char *job_id, BlockDriverState *bs,
+                  BlockDriverState *base, const char *backing_file_str,
+                  BlockDriverState *bottom,
+                  int creation_flags, int64_t speed,
+                  BlockdevOnError on_error,
+                  const char *filter_node_name,
+                  Error **errp);
+
+/**
+ * commit_start:
+ * @job_id: The id of the newly-created job, or %NULL to use the
+ * device name of @bs.
+ * @bs: Active block device.
+ * @top: Top block device to be committed.
+ * @base: Block device that will be written into, and become the new top.
+ * @creation_flags: Flags that control the behavior of the Job lifetime.
+ *                  See @BlockJobCreateFlags
+ * @speed: The maximum speed, in bytes per second, or 0 for unlimited.
+ * @on_error: The action to take upon error.
+ * @backing_file_str: String to use as the backing file in @top's overlay
+ * @filter_node_name: The node name that should be assigned to the filter
+ * driver that the commit job inserts into the graph above @top. NULL means
+ * that a node name should be autogenerated.
+ * @errp: Error object.
+ *
+ */
+void commit_start(const char *job_id, BlockDriverState *bs,
+                  BlockDriverState *base, BlockDriverState *top,
+                  int creation_flags, int64_t speed,
+                  BlockdevOnError on_error, const char *backing_file_str,
+                  const char *filter_node_name, Error **errp);
+/**
+ * commit_active_start:
+ * @job_id: The id of the newly-created job, or %NULL to use the
+ * device name of @bs.
+ * @bs: Active block device to be committed.
+ * @base: Block device that will be written into, and become the new top.
+ * @creation_flags: Flags that control the behavior of the Job lifetime.
+ *                  See @BlockJobCreateFlags
+ * @speed: The maximum speed, in bytes per second, or 0 for unlimited.
+ * @on_error: The action to take upon error.
+ * @filter_node_name: The node name that should be assigned to the filter
+ * driver that the commit job inserts into the graph above @bs. NULL means that
+ * a node name should be autogenerated.
+ * @cb: Completion function for the job.
+ * @opaque: Opaque pointer value passed to @cb.
+ * @auto_complete: Auto complete the job.
+ * @errp: Error object.
+ *
+ */
+BlockJob *commit_active_start(const char *job_id, BlockDriverState *bs,
+                              BlockDriverState *base, int creation_flags,
+                              int64_t speed, BlockdevOnError on_error,
+                              const char *filter_node_name,
+                              BlockCompletionFunc *cb, void *opaque,
+                              bool auto_complete, Error **errp);
+/*
+ * mirror_start:
+ * @job_id: The id of the newly-created job, or %NULL to use the
+ * device name of @bs.
+ * @bs: Block device to operate on.
+ * @target: Block device to write to.
+ * @replaces: Block graph node name to replace once the mirror is done. Can
+ *            only be used when full mirroring is selected.
+ * @creation_flags: Flags that control the behavior of the Job lifetime.
+ *                  See @BlockJobCreateFlags
+ * @speed: The maximum speed, in bytes per second, or 0 for unlimited.
+ * @granularity: The chosen granularity for the dirty bitmap.
+ * @buf_size: The amount of data that can be in flight at one time.
+ * @mode: Whether to collapse all images in the chain to the target.
+ * @backing_mode: How to establish the target's backing chain after completion.
+ * @zero_target: Whether the target should be explicitly zero-initialized
+ * @on_source_error: The action to take upon error reading from the source.
+ * @on_target_error: The action to take upon error writing to the target.
+ * @unmap: Whether to unmap target where source sectors only contain zeroes.
+ * @filter_node_name: The node name that should be assigned to the filter
+ * driver that the mirror job inserts into the graph above @bs. NULL means that
+ * a node name should be autogenerated.
+ * @copy_mode: When to trigger writes to the target.
+ * @errp: Error object.
+ *
+ * Start a mirroring operation on @bs.  Clusters that are allocated
+ * in @bs will be written to @target until the job is cancelled or
+ * manually completed.  At the end of a successful mirroring job,
+ * @bs will be switched to read from @target.
+ */
+void mirror_start(const char *job_id, BlockDriverState *bs,
+                  BlockDriverState *target, const char *replaces,
+                  int creation_flags, int64_t speed,
+                  uint32_t granularity, int64_t buf_size,
+                  MirrorSyncMode mode, BlockMirrorBackingMode backing_mode,
+                  bool zero_target,
+                  BlockdevOnError on_source_error,
+                  BlockdevOnError on_target_error,
+                  bool unmap, const char *filter_node_name,
+                  MirrorCopyMode copy_mode, Error **errp);
+
+/*
+ * backup_job_create:
+ * @job_id: The id of the newly-created job, or %NULL to use the
+ * device name of @bs.
+ * @bs: Block device to operate on.
+ * @target: Block device to write to.
+ * @speed: The maximum speed, in bytes per second, or 0 for unlimited.
+ * @sync_mode: What parts of the disk image should be copied to the destination.
+ * @sync_bitmap: The dirty bitmap if sync_mode is 'bitmap' or 'incremental'
+ * @bitmap_mode: The bitmap synchronization policy to use.
+ * @perf: Performance options. All actual fields assumed to be present,
+ *        all ".has_*" fields are ignored.
+ * @on_source_error: The action to take upon error reading from the source.
+ * @on_target_error: The action to take upon error writing to the target.
+ * @creation_flags: Flags that control the behavior of the Job lifetime.
+ *                  See @BlockJobCreateFlags
+ * @cb: Completion function for the job.
+ * @opaque: Opaque pointer value passed to @cb.
+ * @txn: Transaction that this job is part of (may be NULL).
+ *
+ * Create a backup operation on @bs.  Clusters in @bs are written to @target
+ * until the job is cancelled or manually completed.
+ */
+BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
+                            BlockDriverState *target, int64_t speed,
+                            MirrorSyncMode sync_mode,
+                            BdrvDirtyBitmap *sync_bitmap,
+                            BitmapSyncMode bitmap_mode,
+                            bool compress,
+                            const char *filter_node_name,
+                            BackupPerf *perf,
+                            BlockdevOnError on_source_error,
+                            BlockdevOnError on_target_error,
+                            int creation_flags,
+                            BlockCompletionFunc *cb, void *opaque,
+                            JobTxn *txn, Error **errp);
+
+BdrvChild *bdrv_root_attach_child(BlockDriverState *child_bs,
+                                  const char *child_name,
+                                  const BdrvChildClass *child_class,
+                                  BdrvChildRole child_role,
+                                  uint64_t perm, uint64_t shared_perm,
+                                  void *opaque, Error **errp);
+void bdrv_root_unref_child(BdrvChild *child);
+
+void bdrv_get_cumulative_perm(BlockDriverState *bs, uint64_t *perm,
+                              uint64_t *shared_perm);
+
+/**
+ * Sets a BdrvChild's permissions.  Avoid if the parent is a BDS; use
+ * bdrv_child_refresh_perms() instead and make the parent's
+ * .bdrv_child_perm() implementation return the correct values.
+ */
+int bdrv_child_try_set_perm(BdrvChild *c, uint64_t perm, uint64_t shared,
+                            Error **errp);
+
+/**
+ * Calls bs->drv->bdrv_child_perm() and updates the child's permission
+ * masks with the result.
+ * Drivers should invoke this function whenever an event occurs that
+ * makes their .bdrv_child_perm() implementation return different
+ * values than before, but which will not result in the block layer
+ * automatically refreshing the permissions.
+ */
+int bdrv_child_refresh_perms(BlockDriverState *bs, BdrvChild *c, Error **errp);
+
+bool bdrv_recurse_can_replace(BlockDriverState *bs,
+                              BlockDriverState *to_replace);
+
+/*
+ * Default implementation for BlockDriver.bdrv_child_perm() that can
+ * be used by block filters and image formats, as long as they use the
+ * child_of_bds child class and set an appropriate BdrvChildRole.
+ */
+void bdrv_default_perms(BlockDriverState *bs, BdrvChild *c,
+                        BdrvChildRole role, BlockReopenQueue *reopen_queue,
+                        uint64_t perm, uint64_t shared,
+                        uint64_t *nperm, uint64_t *nshared);
+
+const char *bdrv_get_parent_name(const BlockDriverState *bs);
+void blk_dev_change_media_cb(BlockBackend *blk, bool load, Error **errp);
+bool blk_dev_has_removable_media(BlockBackend *blk);
+void blk_dev_eject_request(BlockBackend *blk, bool force);
+bool blk_dev_is_medium_locked(BlockBackend *blk);
+
+void bdrv_restore_dirty_bitmap(BdrvDirtyBitmap *bitmap, HBitmap *backup);
+
+void bdrv_set_monitor_owned(BlockDriverState *bs);
+
+void blockdev_close_all_bdrv_states(void);
+
+BlockDriverState *bds_tree_init(QDict *bs_opts, Error **errp);
+
+/**
+ * Simple implementation of bdrv_co_create_opts for protocol drivers
+ * which only support creation via opening a file
+ * (usually existing raw storage device)
+ */
+int coroutine_fn bdrv_co_create_opts_simple(BlockDriver *drv,
+                                            const char *filename,
+                                            QemuOpts *opts,
+                                            Error **errp);
+
+BdrvDirtyBitmap *block_dirty_bitmap_lookup(const char *node,
+                                           const char *name,
+                                           BlockDriverState **pbs,
+                                           Error **errp);
+BdrvDirtyBitmap *block_dirty_bitmap_merge(const char *node, const char *target,
+                                          BlockDirtyBitmapMergeSourceList *bms,
+                                          HBitmap **backup, Error **errp);
+BdrvDirtyBitmap *block_dirty_bitmap_remove(const char *node, const char *name,
+                                           bool release,
+                                           BlockDriverState **bitmap_bs,
+                                           Error **errp);
+
+
+BlockDriverState *bdrv_skip_implicit_filters(BlockDriverState *bs);
+
+/**
+ * bdrv_add_aio_context_notifier:
+ *
+ * If a long-running job intends to be always run in the same AioContext as a
+ * certain BDS, it may use this function to be notified of changes regarding the
+ * association of the BDS to an AioContext.
+ *
+ * attached_aio_context() is called after the target BDS has been attached to a
+ * new AioContext; detach_aio_context() is called before the target BDS is being
+ * detached from its old AioContext.
+ */
+void bdrv_add_aio_context_notifier(BlockDriverState *bs,
+        void (*attached_aio_context)(AioContext *new_context, void *opaque),
+        void (*detach_aio_context)(void *opaque), void *opaque);
+
+/**
+ * bdrv_remove_aio_context_notifier:
+ *
+ * Unsubscribe of change notifications regarding the BDS's AioContext. The
+ * parameters given here have to be the same as those given to
+ * bdrv_add_aio_context_notifier().
+ */
+void bdrv_remove_aio_context_notifier(BlockDriverState *bs,
+                                      void (*aio_context_attached)(AioContext *,
+                                                                   void *),
+                                      void (*aio_context_detached)(void *),
+                                      void *opaque);
+
+/**
+ * End all quiescent sections started by bdrv_drain_all_begin(). This is
+ * needed when deleting a BDS before bdrv_drain_all_end() is called.
+ *
+ * NOTE: this is an internal helper for bdrv_close() *only*. No one else
+ * should call it.
+ */
+void bdrv_drain_all_end_quiesce(BlockDriverState *bs);
+
+#endif /* BLOCK_INT_GLOBAL_STATE*/
diff --git a/include/block/block_int-io.h b/include/block/block_int-io.h
new file mode 100644
index 0000000000..f74eb46ff8
--- /dev/null
+++ b/include/block/block_int-io.h
@@ -0,0 +1,163 @@
+/*
+ * QEMU System Emulator block driver
+ *
+ * Copyright (c) 2003 Fabrice Bellard
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+#ifndef BLOCK_INT_IO_H
+#define BLOCK_INT_IO_H
+
+#include "block_int-common.h"
+
+/*
+ * I/O API functions. These functions are thread-safe.
+ *
+ * See include/block/block-io.h for more information about
+ * the I/O API.
+ */
+
+int coroutine_fn bdrv_co_preadv(BdrvChild *child,
+    int64_t offset, int64_t bytes, QEMUIOVector *qiov,
+    BdrvRequestFlags flags);
+int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
+    int64_t offset, int64_t bytes,
+    QEMUIOVector *qiov, size_t qiov_offset, BdrvRequestFlags flags);
+int coroutine_fn bdrv_co_pwritev(BdrvChild *child,
+    int64_t offset, int64_t bytes, QEMUIOVector *qiov,
+    BdrvRequestFlags flags);
+int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
+    int64_t offset, int64_t bytes,
+    QEMUIOVector *qiov, size_t qiov_offset, BdrvRequestFlags flags);
+
+static inline int coroutine_fn bdrv_co_pread(BdrvChild *child,
+    int64_t offset, unsigned int bytes, void *buf, BdrvRequestFlags flags)
+{
+    QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
+
+    return bdrv_co_preadv(child, offset, bytes, &qiov, flags);
+}
+
+static inline int coroutine_fn bdrv_co_pwrite(BdrvChild *child,
+    int64_t offset, unsigned int bytes, void *buf, BdrvRequestFlags flags)
+{
+    QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
+
+    return bdrv_co_pwritev(child, offset, bytes, &qiov, flags);
+}
+
+bool coroutine_fn bdrv_make_request_serialising(BdrvTrackedRequest *req,
+                                                uint64_t align);
+BdrvTrackedRequest *coroutine_fn bdrv_co_get_self_request(BlockDriverState *bs);
+
+/**
+ * bdrv_wakeup:
+ * @bs: The BlockDriverState for which an I/O operation has been completed.
+ *
+ * Wake up the main thread if it is waiting on BDRV_POLL_WHILE.  During
+ * synchronous I/O on a BlockDriverState that is attached to another
+ * I/O thread, the main thread lets the I/O thread's event loop run,
+ * waiting for the I/O operation to complete.  A bdrv_wakeup will wake
+ * up the main thread if necessary.
+ *
+ * Manual calls to bdrv_wakeup are rarely necessary, because
+ * bdrv_dec_in_flight already calls it.
+ */
+void bdrv_wakeup(BlockDriverState *bs);
+
+bool blk_dev_has_tray(BlockBackend *blk);
+bool blk_dev_is_tray_open(BlockBackend *blk);
+
+void bdrv_set_dirty(BlockDriverState *bs, int64_t offset, int64_t bytes);
+
+void bdrv_clear_dirty_bitmap(BdrvDirtyBitmap *bitmap, HBitmap **out);
+bool bdrv_dirty_bitmap_merge_internal(BdrvDirtyBitmap *dest,
+                                      const BdrvDirtyBitmap *src,
+                                      HBitmap **backup, bool lock);
+
+void bdrv_inc_in_flight(BlockDriverState *bs);
+void bdrv_dec_in_flight(BlockDriverState *bs);
+
+int coroutine_fn bdrv_co_copy_range_from(BdrvChild *src, int64_t src_offset,
+                                         BdrvChild *dst, int64_t dst_offset,
+                                         int64_t bytes,
+                                         BdrvRequestFlags read_flags,
+                                         BdrvRequestFlags write_flags);
+int coroutine_fn bdrv_co_copy_range_to(BdrvChild *src, int64_t src_offset,
+                                       BdrvChild *dst, int64_t dst_offset,
+                                       int64_t bytes,
+                                       BdrvRequestFlags read_flags,
+                                       BdrvRequestFlags write_flags);
+
+int refresh_total_sectors(BlockDriverState *bs, int64_t hint);
+
+BdrvChild *bdrv_cow_child(BlockDriverState *bs);
+BdrvChild *bdrv_filter_child(BlockDriverState *bs);
+BdrvChild *bdrv_filter_or_cow_child(BlockDriverState *bs);
+BdrvChild *bdrv_primary_child(BlockDriverState *bs);
+BlockDriverState *bdrv_skip_filters(BlockDriverState *bs);
+BlockDriverState *bdrv_backing_chain_next(BlockDriverState *bs);
+
+static inline BlockDriverState *bdrv_cow_bs(BlockDriverState *bs)
+{
+    return child_bs(bdrv_cow_child(bs));
+}
+
+static inline BlockDriverState *bdrv_filter_bs(BlockDriverState *bs)
+{
+    return child_bs(bdrv_filter_child(bs));
+}
+
+static inline BlockDriverState *bdrv_filter_or_cow_bs(BlockDriverState *bs)
+{
+    return child_bs(bdrv_filter_or_cow_child(bs));
+}
+
+static inline BlockDriverState *bdrv_primary_bs(BlockDriverState *bs)
+{
+    return child_bs(bdrv_primary_child(bs));
+}
+
+/**
+ * Check whether the given offset is in the cached block-status data
+ * region.
+ *
+ * If it is, and @pnum is not NULL, *pnum is set to
+ * `bsc.data_end - offset`, i.e. how many bytes, starting from
+ * @offset, are data (according to the cache).
+ * Otherwise, *pnum is not touched.
+ */
+bool bdrv_bsc_is_data(BlockDriverState *bs, int64_t offset, int64_t *pnum);
+
+/**
+ * If [offset, offset + bytes) overlaps with the currently cached
+ * block-status region, invalidate the cache.
+ *
+ * (To be used by I/O paths that cause data regions to be zero or
+ * holes.)
+ */
+void bdrv_bsc_invalidate_range(BlockDriverState *bs,
+                               int64_t offset, int64_t bytes);
+
+/**
+ * Mark the range [offset, offset + bytes) as a data region.
+ */
+void bdrv_bsc_fill(BlockDriverState *bs, int64_t offset, int64_t bytes);
+
+#endif /* BLOCK_INT_IO_H */
diff --git a/include/block/block_int.h b/include/block/block_int.h
index f4c75e8ba9..7d50b6bbd1 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -24,1481 +24,9 @@
 #ifndef BLOCK_INT_H
 #define BLOCK_INT_H
 
-#include "block/accounting.h"
-#include "block/block.h"
-#include "block/aio-wait.h"
-#include "qemu/queue.h"
-#include "qemu/coroutine.h"
-#include "qemu/stats64.h"
-#include "qemu/timer.h"
-#include "qemu/hbitmap.h"
-#include "block/snapshot.h"
-#include "qemu/throttle.h"
-#include "qemu/rcu.h"
+#include "block_int-global-state.h"
+#include "block_int-io.h"
 
-#define BLOCK_FLAG_LAZY_REFCOUNTS   8
-
-#define BLOCK_OPT_SIZE              "size"
-#define BLOCK_OPT_ENCRYPT           "encryption"
-#define BLOCK_OPT_ENCRYPT_FORMAT    "encrypt.format"
-#define BLOCK_OPT_COMPAT6           "compat6"
-#define BLOCK_OPT_HWVERSION         "hwversion"
-#define BLOCK_OPT_BACKING_FILE      "backing_file"
-#define BLOCK_OPT_BACKING_FMT       "backing_fmt"
-#define BLOCK_OPT_CLUSTER_SIZE      "cluster_size"
-#define BLOCK_OPT_TABLE_SIZE        "table_size"
-#define BLOCK_OPT_PREALLOC          "preallocation"
-#define BLOCK_OPT_SUBFMT            "subformat"
-#define BLOCK_OPT_COMPAT_LEVEL      "compat"
-#define BLOCK_OPT_LAZY_REFCOUNTS    "lazy_refcounts"
-#define BLOCK_OPT_ADAPTER_TYPE      "adapter_type"
-#define BLOCK_OPT_REDUNDANCY        "redundancy"
-#define BLOCK_OPT_NOCOW             "nocow"
-#define BLOCK_OPT_EXTENT_SIZE_HINT  "extent_size_hint"
-#define BLOCK_OPT_OBJECT_SIZE       "object_size"
-#define BLOCK_OPT_REFCOUNT_BITS     "refcount_bits"
-#define BLOCK_OPT_DATA_FILE         "data_file"
-#define BLOCK_OPT_DATA_FILE_RAW     "data_file_raw"
-#define BLOCK_OPT_COMPRESSION_TYPE  "compression_type"
-#define BLOCK_OPT_EXTL2             "extended_l2"
-
-#define BLOCK_PROBE_BUF_SIZE        512
-
-enum BdrvTrackedRequestType {
-    BDRV_TRACKED_READ,
-    BDRV_TRACKED_WRITE,
-    BDRV_TRACKED_DISCARD,
-    BDRV_TRACKED_TRUNCATE,
-};
-
-/*
- * That is not quite good that BdrvTrackedRequest structure is public,
- * as block/io.c is very careful about incoming offset/bytes being
- * correct. Be sure to assert bdrv_check_request() succeeded after any
- * modification of BdrvTrackedRequest object out of block/io.c
- */
-typedef struct BdrvTrackedRequest {
-    BlockDriverState *bs;
-    int64_t offset;
-    int64_t bytes;
-    enum BdrvTrackedRequestType type;
-
-    bool serialising;
-    int64_t overlap_offset;
-    int64_t overlap_bytes;
-
-    QLIST_ENTRY(BdrvTrackedRequest) list;
-    Coroutine *co; /* owner, used for deadlock detection */
-    CoQueue wait_queue; /* coroutines blocked on this request */
-
-    struct BdrvTrackedRequest *waiting_for;
-} BdrvTrackedRequest;
-
-int bdrv_check_qiov_request(int64_t offset, int64_t bytes,
-                            QEMUIOVector *qiov, size_t qiov_offset,
-                            Error **errp);
-int bdrv_check_request(int64_t offset, int64_t bytes, Error **errp);
-
-struct BlockDriver {
-    const char *format_name;
-    int instance_size;
-
-    /* set to true if the BlockDriver is a block filter. Block filters pass
-     * certain callbacks that refer to data (see block.c) to their bs->file
-     * or bs->backing (whichever one exists) if the driver doesn't implement
-     * them. Drivers that do not wish to forward must implement them and return
-     * -ENOTSUP.
-     * Note that filters are not allowed to modify data.
-     *
-     * Filters generally cannot have more than a single filtered child,
-     * because the data they present must at all times be the same as
-     * that on their filtered child.  That would be impossible to
-     * achieve for multiple filtered children.
-     * (And this filtered child must then be bs->file or bs->backing.)
-     */
-    bool is_filter;
-    /*
-     * Set to true if the BlockDriver is a format driver.  Format nodes
-     * generally do not expect their children to be other format nodes
-     * (except for backing files), and so format probing is disabled
-     * on those children.
-     */
-    bool is_format;
-    /*
-     * Return true if @to_replace can be replaced by a BDS with the
-     * same data as @bs without it affecting @bs's behavior (that is,
-     * without it being visible to @bs's parents).
-     */
-    bool (*bdrv_recurse_can_replace)(BlockDriverState *bs,
-                                     BlockDriverState *to_replace);
-
-    int (*bdrv_probe)(const uint8_t *buf, int buf_size, const char *filename);
-    int (*bdrv_probe_device)(const char *filename);
-
-    /* Any driver implementing this callback is expected to be able to handle
-     * NULL file names in its .bdrv_open() implementation */
-    void (*bdrv_parse_filename)(const char *filename, QDict *options, Error **errp);
-    /* Drivers not implementing bdrv_parse_filename nor bdrv_open should have
-     * this field set to true, except ones that are defined only by their
-     * child's bs.
-     * An example of the last type will be the quorum block driver.
-     */
-    bool bdrv_needs_filename;
-
-    /*
-     * Set if a driver can support backing files. This also implies the
-     * following semantics:
-     *
-     *  - Return status 0 of .bdrv_co_block_status means that corresponding
-     *    blocks are not allocated in this layer of backing-chain
-     *  - For such (unallocated) blocks, read will:
-     *    - fill buffer with zeros if there is no backing file
-     *    - read from the backing file otherwise, where the block layer
-     *      takes care of reading zeros beyond EOF if backing file is short
-     */
-    bool supports_backing;
-
-    /* For handling image reopen for split or non-split files */
-    int (*bdrv_reopen_prepare)(BDRVReopenState *reopen_state,
-                               BlockReopenQueue *queue, Error **errp);
-    void (*bdrv_reopen_commit)(BDRVReopenState *reopen_state);
-    void (*bdrv_reopen_commit_post)(BDRVReopenState *reopen_state);
-    void (*bdrv_reopen_abort)(BDRVReopenState *reopen_state);
-    void (*bdrv_join_options)(QDict *options, QDict *old_options);
-
-    int (*bdrv_open)(BlockDriverState *bs, QDict *options, int flags,
-                     Error **errp);
-
-    /* Protocol drivers should implement this instead of bdrv_open */
-    int (*bdrv_file_open)(BlockDriverState *bs, QDict *options, int flags,
-                          Error **errp);
-    void (*bdrv_close)(BlockDriverState *bs);
-
-
-    int coroutine_fn (*bdrv_co_create)(BlockdevCreateOptions *opts,
-                                       Error **errp);
-    int coroutine_fn (*bdrv_co_create_opts)(BlockDriver *drv,
-                                            const char *filename,
-                                            QemuOpts *opts,
-                                            Error **errp);
-
-    int coroutine_fn (*bdrv_co_amend)(BlockDriverState *bs,
-                                      BlockdevAmendOptions *opts,
-                                      bool force,
-                                      Error **errp);
-
-    int (*bdrv_amend_options)(BlockDriverState *bs,
-                              QemuOpts *opts,
-                              BlockDriverAmendStatusCB *status_cb,
-                              void *cb_opaque,
-                              bool force,
-                              Error **errp);
-
-    int (*bdrv_make_empty)(BlockDriverState *bs);
-
-    /*
-     * Refreshes the bs->exact_filename field. If that is impossible,
-     * bs->exact_filename has to be left empty.
-     */
-    void (*bdrv_refresh_filename)(BlockDriverState *bs);
-
-    /*
-     * Gathers the open options for all children into @target.
-     * A simple format driver (without backing file support) might
-     * implement this function like this:
-     *
-     *     QINCREF(bs->file->bs->full_open_options);
-     *     qdict_put(target, "file", bs->file->bs->full_open_options);
-     *
-     * If not specified, the generic implementation will simply put
-     * all children's options under their respective name.
-     *
-     * @backing_overridden is true when bs->backing seems not to be
-     * the child that would result from opening bs->backing_file.
-     * Therefore, if it is true, the backing child's options should be
-     * gathered; otherwise, there is no need since the backing child
-     * is the one implied by the image header.
-     *
-     * Note that ideally this function would not be needed.  Every
-     * block driver which implements it is probably doing something
-     * shady regarding its runtime option structure.
-     */
-    void (*bdrv_gather_child_options)(BlockDriverState *bs, QDict *target,
-                                      bool backing_overridden);
-
-    /*
-     * Returns an allocated string which is the directory name of this BDS: It
-     * will be used to make relative filenames absolute by prepending this
-     * function's return value to them.
-     */
-    char *(*bdrv_dirname)(BlockDriverState *bs, Error **errp);
-
-    /* aio */
-    BlockAIOCB *(*bdrv_aio_preadv)(BlockDriverState *bs,
-        int64_t offset, int64_t bytes, QEMUIOVector *qiov,
-        BdrvRequestFlags flags, BlockCompletionFunc *cb, void *opaque);
-    BlockAIOCB *(*bdrv_aio_pwritev)(BlockDriverState *bs,
-        int64_t offset, int64_t bytes, QEMUIOVector *qiov,
-        BdrvRequestFlags flags, BlockCompletionFunc *cb, void *opaque);
-    BlockAIOCB *(*bdrv_aio_flush)(BlockDriverState *bs,
-        BlockCompletionFunc *cb, void *opaque);
-    BlockAIOCB *(*bdrv_aio_pdiscard)(BlockDriverState *bs,
-        int64_t offset, int bytes,
-        BlockCompletionFunc *cb, void *opaque);
-
-    int coroutine_fn (*bdrv_co_readv)(BlockDriverState *bs,
-        int64_t sector_num, int nb_sectors, QEMUIOVector *qiov);
-
-    /**
-     * @offset: position in bytes to read at
-     * @bytes: number of bytes to read
-     * @qiov: the buffers to fill with read data
-     * @flags: currently unused, always 0
-     *
-     * @offset and @bytes will be a multiple of 'request_alignment',
-     * but the length of individual @qiov elements does not have to
-     * be a multiple.
-     *
-     * @bytes will always equal the total size of @qiov, and will be
-     * no larger than 'max_transfer'.
-     *
-     * The buffer in @qiov may point directly to guest memory.
-     */
-    int coroutine_fn (*bdrv_co_preadv)(BlockDriverState *bs,
-        int64_t offset, int64_t bytes, QEMUIOVector *qiov,
-        BdrvRequestFlags flags);
-    int coroutine_fn (*bdrv_co_preadv_part)(BlockDriverState *bs,
-        int64_t offset, int64_t bytes,
-        QEMUIOVector *qiov, size_t qiov_offset, BdrvRequestFlags flags);
-    int coroutine_fn (*bdrv_co_writev)(BlockDriverState *bs,
-        int64_t sector_num, int nb_sectors, QEMUIOVector *qiov, int flags);
-    /**
-     * @offset: position in bytes to write at
-     * @bytes: number of bytes to write
-     * @qiov: the buffers containing data to write
-     * @flags: zero or more bits allowed by 'supported_write_flags'
-     *
-     * @offset and @bytes will be a multiple of 'request_alignment',
-     * but the length of individual @qiov elements does not have to
-     * be a multiple.
-     *
-     * @bytes will always equal the total size of @qiov, and will be
-     * no larger than 'max_transfer'.
-     *
-     * The buffer in @qiov may point directly to guest memory.
-     */
-    int coroutine_fn (*bdrv_co_pwritev)(BlockDriverState *bs,
-        int64_t offset, int64_t bytes, QEMUIOVector *qiov,
-        BdrvRequestFlags flags);
-    int coroutine_fn (*bdrv_co_pwritev_part)(BlockDriverState *bs,
-        int64_t offset, int64_t bytes, QEMUIOVector *qiov, size_t qiov_offset,
-        BdrvRequestFlags flags);
-
-    /*
-     * Efficiently zero a region of the disk image.  Typically an image format
-     * would use a compact metadata representation to implement this.  This
-     * function pointer may be NULL or return -ENOSUP and .bdrv_co_writev()
-     * will be called instead.
-     */
-    int coroutine_fn (*bdrv_co_pwrite_zeroes)(BlockDriverState *bs,
-        int64_t offset, int64_t bytes, BdrvRequestFlags flags);
-    int coroutine_fn (*bdrv_co_pdiscard)(BlockDriverState *bs,
-        int64_t offset, int64_t bytes);
-
-    /* Map [offset, offset + nbytes) range onto a child of @bs to copy from,
-     * and invoke bdrv_co_copy_range_from(child, ...), or invoke
-     * bdrv_co_copy_range_to() if @bs is the leaf child to copy data from.
-     *
-     * See the comment of bdrv_co_copy_range for the parameter and return value
-     * semantics.
-     */
-    int coroutine_fn (*bdrv_co_copy_range_from)(BlockDriverState *bs,
-                                                BdrvChild *src,
-                                                int64_t offset,
-                                                BdrvChild *dst,
-                                                int64_t dst_offset,
-                                                int64_t bytes,
-                                                BdrvRequestFlags read_flags,
-                                                BdrvRequestFlags write_flags);
-
-    /* Map [offset, offset + nbytes) range onto a child of bs to copy data to,
-     * and invoke bdrv_co_copy_range_to(child, src, ...), or perform the copy
-     * operation if @bs is the leaf and @src has the same BlockDriver.  Return
-     * -ENOTSUP if @bs is the leaf but @src has a different BlockDriver.
-     *
-     * See the comment of bdrv_co_copy_range for the parameter and return value
-     * semantics.
-     */
-    int coroutine_fn (*bdrv_co_copy_range_to)(BlockDriverState *bs,
-                                              BdrvChild *src,
-                                              int64_t src_offset,
-                                              BdrvChild *dst,
-                                              int64_t dst_offset,
-                                              int64_t bytes,
-                                              BdrvRequestFlags read_flags,
-                                              BdrvRequestFlags write_flags);
-
-    /*
-     * Building block for bdrv_block_status[_above] and
-     * bdrv_is_allocated[_above].  The driver should answer only
-     * according to the current layer, and should only need to set
-     * BDRV_BLOCK_DATA, BDRV_BLOCK_ZERO, BDRV_BLOCK_OFFSET_VALID,
-     * and/or BDRV_BLOCK_RAW; if the current layer defers to a backing
-     * layer, the result should be 0 (and not BDRV_BLOCK_ZERO).  See
-     * block.h for the overall meaning of the bits.  As a hint, the
-     * flag want_zero is true if the caller cares more about precise
-     * mappings (favor accurate _OFFSET_VALID/_ZERO) or false for
-     * overall allocation (favor larger *pnum, perhaps by reporting
-     * _DATA instead of _ZERO).  The block layer guarantees input
-     * clamped to bdrv_getlength() and aligned to request_alignment,
-     * as well as non-NULL pnum, map, and file; in turn, the driver
-     * must return an error or set pnum to an aligned non-zero value.
-     *
-     * Note that @bytes is just a hint on how big of a region the
-     * caller wants to inspect.  It is not a limit on *pnum.
-     * Implementations are free to return larger values of *pnum if
-     * doing so does not incur a performance penalty.
-     *
-     * block/io.c's bdrv_co_block_status() will utilize an unclamped
-     * *pnum value for the block-status cache on protocol nodes, prior
-     * to clamping *pnum for return to its caller.
-     */
-    int coroutine_fn (*bdrv_co_block_status)(BlockDriverState *bs,
-        bool want_zero, int64_t offset, int64_t bytes, int64_t *pnum,
-        int64_t *map, BlockDriverState **file);
-
-    /*
-     * This informs the driver that we are no longer interested in the result
-     * of in-flight requests, so don't waste the time if possible.
-     *
-     * One example usage is to avoid waiting for an nbd target node reconnect
-     * timeout during job-cancel with force=true.
-     */
-    void (*bdrv_cancel_in_flight)(BlockDriverState *bs);
-
-    /*
-     * Invalidate any cached meta-data.
-     */
-    void coroutine_fn (*bdrv_co_invalidate_cache)(BlockDriverState *bs,
-                                                  Error **errp);
-    int (*bdrv_inactivate)(BlockDriverState *bs);
-
-    /*
-     * Flushes all data for all layers by calling bdrv_co_flush for underlying
-     * layers, if needed. This function is needed for deterministic
-     * synchronization of the flush finishing callback.
-     */
-    int coroutine_fn (*bdrv_co_flush)(BlockDriverState *bs);
-
-    /* Delete a created file. */
-    int coroutine_fn (*bdrv_co_delete_file)(BlockDriverState *bs,
-                                            Error **errp);
-
-    /*
-     * Flushes all data that was already written to the OS all the way down to
-     * the disk (for example file-posix.c calls fsync()).
-     */
-    int coroutine_fn (*bdrv_co_flush_to_disk)(BlockDriverState *bs);
-
-    /*
-     * Flushes all internal caches to the OS. The data may still sit in a
-     * writeback cache of the host OS, but it will survive a crash of the qemu
-     * process.
-     */
-    int coroutine_fn (*bdrv_co_flush_to_os)(BlockDriverState *bs);
-
-    /*
-     * Drivers setting this field must be able to work with just a plain
-     * filename with '<protocol_name>:' as a prefix, and no other options.
-     * Options may be extracted from the filename by implementing
-     * bdrv_parse_filename.
-     */
-    const char *protocol_name;
-
-    /*
-     * Truncate @bs to @offset bytes using the given @prealloc mode
-     * when growing.  Modes other than PREALLOC_MODE_OFF should be
-     * rejected when shrinking @bs.
-     *
-     * If @exact is true, @bs must be resized to exactly @offset.
-     * Otherwise, it is sufficient for @bs (if it is a host block
-     * device and thus there is no way to resize it) to be at least
-     * @offset bytes in length.
-     *
-     * If @exact is true and this function fails but would succeed
-     * with @exact = false, it should return -ENOTSUP.
-     */
-    int coroutine_fn (*bdrv_co_truncate)(BlockDriverState *bs, int64_t offset,
-                                         bool exact, PreallocMode prealloc,
-                                         BdrvRequestFlags flags, Error **errp);
-
-    int64_t (*bdrv_getlength)(BlockDriverState *bs);
-    bool has_variable_length;
-    int64_t (*bdrv_get_allocated_file_size)(BlockDriverState *bs);
-    BlockMeasureInfo *(*bdrv_measure)(QemuOpts *opts, BlockDriverState *in_bs,
-                                      Error **errp);
-
-    int coroutine_fn (*bdrv_co_pwritev_compressed)(BlockDriverState *bs,
-        int64_t offset, int64_t bytes, QEMUIOVector *qiov);
-    int coroutine_fn (*bdrv_co_pwritev_compressed_part)(BlockDriverState *bs,
-        int64_t offset, int64_t bytes, QEMUIOVector *qiov, size_t qiov_offset);
-
-    int (*bdrv_snapshot_create)(BlockDriverState *bs,
-                                QEMUSnapshotInfo *sn_info);
-    int (*bdrv_snapshot_goto)(BlockDriverState *bs,
-                              const char *snapshot_id);
-    int (*bdrv_snapshot_delete)(BlockDriverState *bs,
-                                const char *snapshot_id,
-                                const char *name,
-                                Error **errp);
-    int (*bdrv_snapshot_list)(BlockDriverState *bs,
-                              QEMUSnapshotInfo **psn_info);
-    int (*bdrv_snapshot_load_tmp)(BlockDriverState *bs,
-                                  const char *snapshot_id,
-                                  const char *name,
-                                  Error **errp);
-    int (*bdrv_get_info)(BlockDriverState *bs, BlockDriverInfo *bdi);
-    ImageInfoSpecific *(*bdrv_get_specific_info)(BlockDriverState *bs,
-                                                 Error **errp);
-    BlockStatsSpecific *(*bdrv_get_specific_stats)(BlockDriverState *bs);
-
-    int coroutine_fn (*bdrv_save_vmstate)(BlockDriverState *bs,
-                                          QEMUIOVector *qiov,
-                                          int64_t pos);
-    int coroutine_fn (*bdrv_load_vmstate)(BlockDriverState *bs,
-                                          QEMUIOVector *qiov,
-                                          int64_t pos);
-
-    int (*bdrv_change_backing_file)(BlockDriverState *bs,
-        const char *backing_file, const char *backing_fmt);
-
-    /* removable device specific */
-    bool (*bdrv_is_inserted)(BlockDriverState *bs);
-    void (*bdrv_eject)(BlockDriverState *bs, bool eject_flag);
-    void (*bdrv_lock_medium)(BlockDriverState *bs, bool locked);
-
-    /* to control generic scsi devices */
-    BlockAIOCB *(*bdrv_aio_ioctl)(BlockDriverState *bs,
-        unsigned long int req, void *buf,
-        BlockCompletionFunc *cb, void *opaque);
-    int coroutine_fn (*bdrv_co_ioctl)(BlockDriverState *bs,
-                                      unsigned long int req, void *buf);
-
-    /* List of options for creating images, terminated by name == NULL */
-    QemuOptsList *create_opts;
-
-    /* List of options for image amend */
-    QemuOptsList *amend_opts;
-
-    /*
-     * If this driver supports reopening images this contains a
-     * NULL-terminated list of the runtime options that can be
-     * modified. If an option in this list is unspecified during
-     * reopen then it _must_ be reset to its default value or return
-     * an error.
-     */
-    const char *const *mutable_opts;
-
-    /*
-     * Returns 0 for completed check, -errno for internal errors.
-     * The check results are stored in result.
-     */
-    int coroutine_fn (*bdrv_co_check)(BlockDriverState *bs,
-                                      BdrvCheckResult *result,
-                                      BdrvCheckMode fix);
-
-    void (*bdrv_debug_event)(BlockDriverState *bs, BlkdebugEvent event);
-
-    /* TODO Better pass a option string/QDict/QemuOpts to add any rule? */
-    int (*bdrv_debug_breakpoint)(BlockDriverState *bs, const char *event,
-        const char *tag);
-    int (*bdrv_debug_remove_breakpoint)(BlockDriverState *bs,
-        const char *tag);
-    int (*bdrv_debug_resume)(BlockDriverState *bs, const char *tag);
-    bool (*bdrv_debug_is_suspended)(BlockDriverState *bs, const char *tag);
-
-    void (*bdrv_refresh_limits)(BlockDriverState *bs, Error **errp);
-
-    /*
-     * Returns 1 if newly created images are guaranteed to contain only
-     * zeros, 0 otherwise.
-     */
-    int (*bdrv_has_zero_init)(BlockDriverState *bs);
-
-    /* Remove fd handlers, timers, and other event loop callbacks so the event
-     * loop is no longer in use.  Called with no in-flight requests and in
-     * depth-first traversal order with parents before child nodes.
-     */
-    void (*bdrv_detach_aio_context)(BlockDriverState *bs);
-
-    /* Add fd handlers, timers, and other event loop callbacks so I/O requests
-     * can be processed again.  Called with no in-flight requests and in
-     * depth-first traversal order with child nodes before parent nodes.
-     */
-    void (*bdrv_attach_aio_context)(BlockDriverState *bs,
-                                    AioContext *new_context);
-
-    /* io queue for linux-aio */
-    void (*bdrv_io_plug)(BlockDriverState *bs);
-    void (*bdrv_io_unplug)(BlockDriverState *bs);
-
-    /**
-     * Try to get @bs's logical and physical block size.
-     * On success, store them in @bsz and return zero.
-     * On failure, return negative errno.
-     */
-    int (*bdrv_probe_blocksizes)(BlockDriverState *bs, BlockSizes *bsz);
-    /**
-     * Try to get @bs's geometry (cyls, heads, sectors)
-     * On success, store them in @geo and return 0.
-     * On failure return -errno.
-     * Only drivers that want to override guest geometry implement this
-     * callback; see hd_geometry_guess().
-     */
-    int (*bdrv_probe_geometry)(BlockDriverState *bs, HDGeometry *geo);
-
-    /**
-     * bdrv_co_drain_begin is called if implemented in the beginning of a
-     * drain operation to drain and stop any internal sources of requests in
-     * the driver.
-     * bdrv_co_drain_end is called if implemented at the end of the drain.
-     *
-     * They should be used by the driver to e.g. manage scheduled I/O
-     * requests, or toggle an internal state. After the end of the drain new
-     * requests will continue normally.
-     */
-    void coroutine_fn (*bdrv_co_drain_begin)(BlockDriverState *bs);
-    void coroutine_fn (*bdrv_co_drain_end)(BlockDriverState *bs);
-
-    void (*bdrv_add_child)(BlockDriverState *parent, BlockDriverState *child,
-                           Error **errp);
-    void (*bdrv_del_child)(BlockDriverState *parent, BdrvChild *child,
-                           Error **errp);
-
-    /**
-     * Informs the block driver that a permission change is intended. The
-     * driver checks whether the change is permissible and may take other
-     * preparations for the change (e.g. get file system locks). This operation
-     * is always followed either by a call to either .bdrv_set_perm or
-     * .bdrv_abort_perm_update.
-     *
-     * Checks whether the requested set of cumulative permissions in @perm
-     * can be granted for accessing @bs and whether no other users are using
-     * permissions other than those given in @shared (both arguments take
-     * BLK_PERM_* bitmasks).
-     *
-     * If both conditions are met, 0 is returned. Otherwise, -errno is returned
-     * and errp is set to an error describing the conflict.
-     */
-    int (*bdrv_check_perm)(BlockDriverState *bs, uint64_t perm,
-                           uint64_t shared, Error **errp);
-
-    /**
-     * Called to inform the driver that the set of cumulative set of used
-     * permissions for @bs has changed to @perm, and the set of sharable
-     * permission to @shared. The driver can use this to propagate changes to
-     * its children (i.e. request permissions only if a parent actually needs
-     * them).
-     *
-     * This function is only invoked after bdrv_check_perm(), so block drivers
-     * may rely on preparations made in their .bdrv_check_perm implementation.
-     */
-    void (*bdrv_set_perm)(BlockDriverState *bs, uint64_t perm, uint64_t shared);
-
-    /*
-     * Called to inform the driver that after a previous bdrv_check_perm()
-     * call, the permission update is not performed and any preparations made
-     * for it (e.g. taken file locks) need to be undone.
-     *
-     * This function can be called even for nodes that never saw a
-     * bdrv_check_perm() call. It is a no-op then.
-     */
-    void (*bdrv_abort_perm_update)(BlockDriverState *bs);
-
-    /**
-     * Returns in @nperm and @nshared the permissions that the driver for @bs
-     * needs on its child @c, based on the cumulative permissions requested by
-     * the parents in @parent_perm and @parent_shared.
-     *
-     * If @c is NULL, return the permissions for attaching a new child for the
-     * given @child_class and @role.
-     *
-     * If @reopen_queue is non-NULL, don't return the currently needed
-     * permissions, but those that will be needed after applying the
-     * @reopen_queue.
-     */
-     void (*bdrv_child_perm)(BlockDriverState *bs, BdrvChild *c,
-                             BdrvChildRole role,
-                             BlockReopenQueue *reopen_queue,
-                             uint64_t parent_perm, uint64_t parent_shared,
-                             uint64_t *nperm, uint64_t *nshared);
-
-    bool (*bdrv_supports_persistent_dirty_bitmap)(BlockDriverState *bs);
-    bool (*bdrv_co_can_store_new_dirty_bitmap)(BlockDriverState *bs,
-                                               const char *name,
-                                               uint32_t granularity,
-                                               Error **errp);
-    int (*bdrv_co_remove_persistent_dirty_bitmap)(BlockDriverState *bs,
-                                                  const char *name,
-                                                  Error **errp);
-
-    /**
-     * Register/unregister a buffer for I/O. For example, when the driver is
-     * interested to know the memory areas that will later be used in iovs, so
-     * that it can do IOMMU mapping with VFIO etc., in order to get better
-     * performance. In the case of VFIO drivers, this callback is used to do
-     * DMA mapping for hot buffers.
-     */
-    void (*bdrv_register_buf)(BlockDriverState *bs, void *host, size_t size);
-    void (*bdrv_unregister_buf)(BlockDriverState *bs, void *host);
-    QLIST_ENTRY(BlockDriver) list;
-
-    /* Pointer to a NULL-terminated array of names of strong options
-     * that can be specified for bdrv_open(). A strong option is one
-     * that changes the data of a BDS.
-     * If this pointer is NULL, the array is considered empty.
-     * "filename" and "driver" are always considered strong. */
-    const char *const *strong_runtime_opts;
-};
-
-static inline bool block_driver_can_compress(BlockDriver *drv)
-{
-    return drv->bdrv_co_pwritev_compressed ||
-           drv->bdrv_co_pwritev_compressed_part;
-}
-
-typedef struct BlockLimits {
-    /* Alignment requirement, in bytes, for offset/length of I/O
-     * requests. Must be a power of 2 less than INT_MAX; defaults to
-     * 1 for drivers with modern byte interfaces, and to 512
-     * otherwise. */
-    uint32_t request_alignment;
-
-    /*
-     * Maximum number of bytes that can be discarded at once. Must be multiple
-     * of pdiscard_alignment, but need not be power of 2. May be 0 if no
-     * inherent 64-bit limit.
-     */
-    int64_t max_pdiscard;
-
-    /* Optimal alignment for discard requests in bytes. A power of 2
-     * is best but not mandatory.  Must be a multiple of
-     * bl.request_alignment, and must be less than max_pdiscard if
-     * that is set. May be 0 if bl.request_alignment is good enough */
-    uint32_t pdiscard_alignment;
-
-    /*
-     * Maximum number of bytes that can zeroized at once. Must be multiple of
-     * pwrite_zeroes_alignment. 0 means no limit.
-     */
-    int64_t max_pwrite_zeroes;
-
-    /* Optimal alignment for write zeroes requests in bytes. A power
-     * of 2 is best but not mandatory.  Must be a multiple of
-     * bl.request_alignment, and must be less than max_pwrite_zeroes
-     * if that is set. May be 0 if bl.request_alignment is good
-     * enough */
-    uint32_t pwrite_zeroes_alignment;
-
-    /* Optimal transfer length in bytes.  A power of 2 is best but not
-     * mandatory.  Must be a multiple of bl.request_alignment, or 0 if
-     * no preferred size */
-    uint32_t opt_transfer;
-
-    /* Maximal transfer length in bytes.  Need not be power of 2, but
-     * must be multiple of opt_transfer and bl.request_alignment, or 0
-     * for no 32-bit limit.  For now, anything larger than INT_MAX is
-     * clamped down. */
-    uint32_t max_transfer;
-
-    /* Maximal hardware transfer length in bytes.  Applies whenever
-     * transfers to the device bypass the kernel I/O scheduler, for
-     * example with SG_IO.  If larger than max_transfer or if zero,
-     * blk_get_max_hw_transfer will fall back to max_transfer.
-     */
-    uint64_t max_hw_transfer;
-
-    /* Maximal number of scatter/gather elements allowed by the hardware.
-     * Applies whenever transfers to the device bypass the kernel I/O
-     * scheduler, for example with SG_IO.  If larger than max_iov
-     * or if zero, blk_get_max_hw_iov will fall back to max_iov.
-     */
-    int max_hw_iov;
-
-    /* memory alignment, in bytes so that no bounce buffer is needed */
-    size_t min_mem_alignment;
-
-    /* memory alignment, in bytes, for bounce buffer */
-    size_t opt_mem_alignment;
-
-    /* maximum number of iovec elements */
-    int max_iov;
-} BlockLimits;
-
-typedef struct BdrvOpBlocker BdrvOpBlocker;
-
-typedef struct BdrvAioNotifier {
-    void (*attached_aio_context)(AioContext *new_context, void *opaque);
-    void (*detach_aio_context)(void *opaque);
-
-    void *opaque;
-    bool deleted;
-
-    QLIST_ENTRY(BdrvAioNotifier) list;
-} BdrvAioNotifier;
-
-struct BdrvChildClass {
-    /* If true, bdrv_replace_node() doesn't change the node this BdrvChild
-     * points to. */
-    bool stay_at_node;
-
-    /* If true, the parent is a BlockDriverState and bdrv_next_all_states()
-     * will return it. This information is used for drain_all, where every node
-     * will be drained separately, so the drain only needs to be propagated to
-     * non-BDS parents. */
-    bool parent_is_bds;
-
-    void (*inherit_options)(BdrvChildRole role, bool parent_is_format,
-                            int *child_flags, QDict *child_options,
-                            int parent_flags, QDict *parent_options);
-
-    void (*change_media)(BdrvChild *child, bool load);
-    void (*resize)(BdrvChild *child);
-
-    /* Returns a name that is supposedly more useful for human users than the
-     * node name for identifying the node in question (in particular, a BB
-     * name), or NULL if the parent can't provide a better name. */
-    const char *(*get_name)(BdrvChild *child);
-
-    /* Returns a malloced string that describes the parent of the child for a
-     * human reader. This could be a node-name, BlockBackend name, qdev ID or
-     * QOM path of the device owning the BlockBackend, job type and ID etc. The
-     * caller is responsible for freeing the memory. */
-    char *(*get_parent_desc)(BdrvChild *child);
-
-    /*
-     * If this pair of functions is implemented, the parent doesn't issue new
-     * requests after returning from .drained_begin() until .drained_end() is
-     * called.
-     *
-     * These functions must not change the graph (and therefore also must not
-     * call aio_poll(), which could change the graph indirectly).
-     *
-     * If drained_end() schedules background operations, it must atomically
-     * increment *drained_end_counter for each such operation and atomically
-     * decrement it once the operation has settled.
-     *
-     * Note that this can be nested. If drained_begin() was called twice, new
-     * I/O is allowed only after drained_end() was called twice, too.
-     */
-    void (*drained_begin)(BdrvChild *child);
-    void (*drained_end)(BdrvChild *child, int *drained_end_counter);
-
-    /*
-     * Returns whether the parent has pending requests for the child. This
-     * callback is polled after .drained_begin() has been called until all
-     * activity on the child has stopped.
-     */
-    bool (*drained_poll)(BdrvChild *child);
-
-    /* Notifies the parent that the child has been activated/inactivated (e.g.
-     * when migration is completing) and it can start/stop requesting
-     * permissions and doing I/O on it. */
-    void (*activate)(BdrvChild *child, Error **errp);
-    int (*inactivate)(BdrvChild *child);
-
-    void (*attach)(BdrvChild *child);
-    void (*detach)(BdrvChild *child);
-
-    /* Notifies the parent that the filename of its child has changed (e.g.
-     * because the direct child was removed from the backing chain), so that it
-     * can update its reference. */
-    int (*update_filename)(BdrvChild *child, BlockDriverState *new_base,
-                           const char *filename, Error **errp);
-
-    bool (*can_set_aio_ctx)(BdrvChild *child, AioContext *ctx,
-                            GSList **ignore, Error **errp);
-    void (*set_aio_ctx)(BdrvChild *child, AioContext *ctx, GSList **ignore);
-
-    AioContext *(*get_parent_aio_context)(BdrvChild *child);
-};
-
-extern const BdrvChildClass child_of_bds;
-
-struct BdrvChild {
-    BlockDriverState *bs;
-    char *name;
-    const BdrvChildClass *klass;
-    BdrvChildRole role;
-    void *opaque;
-
-    /**
-     * Granted permissions for operating on this BdrvChild (BLK_PERM_* bitmask)
-     */
-    uint64_t perm;
-
-    /**
-     * Permissions that can still be granted to other users of @bs while this
-     * BdrvChild is still attached to it. (BLK_PERM_* bitmask)
-     */
-    uint64_t shared_perm;
-
-    /*
-     * This link is frozen: the child can neither be replaced nor
-     * detached from the parent.
-     */
-    bool frozen;
-
-    /*
-     * How many times the parent of this child has been drained
-     * (through klass->drained_*).
-     * Usually, this is equal to bs->quiesce_counter (potentially
-     * reduced by bdrv_drain_all_count).  It may differ while the
-     * child is entering or leaving a drained section.
-     */
-    int parent_quiesce_counter;
-
-    QLIST_ENTRY(BdrvChild) next;
-    QLIST_ENTRY(BdrvChild) next_parent;
-};
-
-/*
- * Allows bdrv_co_block_status() to cache one data region for a
- * protocol node.
- *
- * @valid: Whether the cache is valid (should be accessed with atomic
- *         functions so this can be reset by RCU readers)
- * @data_start: Offset where we know (or strongly assume) is data
- * @data_end: Offset where the data region ends (which is not necessarily
- *            the start of a zeroed region)
- */
-typedef struct BdrvBlockStatusCache {
-    struct rcu_head rcu;
-
-    bool valid;
-    int64_t data_start;
-    int64_t data_end;
-} BdrvBlockStatusCache;
-
-struct BlockDriverState {
-    /* Protected by big QEMU lock or read-only after opening.  No special
-     * locking needed during I/O...
-     */
-    int open_flags; /* flags used to open the file, re-used for re-open */
-    bool encrypted; /* if true, the media is encrypted */
-    bool sg;        /* if true, the device is a /dev/sg* */
-    bool probed;    /* if true, format was probed rather than specified */
-    bool force_share; /* if true, always allow all shared permissions */
-    bool implicit;  /* if true, this filter node was automatically inserted */
-
-    BlockDriver *drv; /* NULL means no media */
-    void *opaque;
-
-    AioContext *aio_context; /* event loop used for fd handlers, timers, etc */
-    /* long-running tasks intended to always use the same AioContext as this
-     * BDS may register themselves in this list to be notified of changes
-     * regarding this BDS's context */
-    QLIST_HEAD(, BdrvAioNotifier) aio_notifiers;
-    bool walking_aio_notifiers; /* to make removal during iteration safe */
-
-    char filename[PATH_MAX];
-    /*
-     * If not empty, this image is a diff in relation to backing_file.
-     * Note that this is the name given in the image header and
-     * therefore may or may not be equal to .backing->bs->filename.
-     * If this field contains a relative path, it is to be resolved
-     * relatively to the overlay's location.
-     */
-    char backing_file[PATH_MAX];
-    /*
-     * The backing filename indicated by the image header.  Contrary
-     * to backing_file, if we ever open this file, auto_backing_file
-     * is replaced by the resulting BDS's filename (i.e. after a
-     * bdrv_refresh_filename() run).
-     */
-    char auto_backing_file[PATH_MAX];
-    char backing_format[16]; /* if non-zero and backing_file exists */
-
-    QDict *full_open_options;
-    char exact_filename[PATH_MAX];
-
-    BdrvChild *backing;
-    BdrvChild *file;
-
-    /* I/O Limits */
-    BlockLimits bl;
-
-    /*
-     * Flags honored during pread
-     */
-    unsigned int supported_read_flags;
-    /* Flags honored during pwrite (so far: BDRV_REQ_FUA,
-     * BDRV_REQ_WRITE_UNCHANGED).
-     * If a driver does not support BDRV_REQ_WRITE_UNCHANGED, those
-     * writes will be issued as normal writes without the flag set.
-     * This is important to note for drivers that do not explicitly
-     * request a WRITE permission for their children and instead take
-     * the same permissions as their parent did (this is commonly what
-     * block filters do).  Such drivers have to be aware that the
-     * parent may have taken a WRITE_UNCHANGED permission only and is
-     * issuing such requests.  Drivers either must make sure that
-     * these requests do not result in plain WRITE accesses (usually
-     * by supporting BDRV_REQ_WRITE_UNCHANGED, and then forwarding
-     * every incoming write request as-is, including potentially that
-     * flag), or they have to explicitly take the WRITE permission for
-     * their children. */
-    unsigned int supported_write_flags;
-    /* Flags honored during pwrite_zeroes (so far: BDRV_REQ_FUA,
-     * BDRV_REQ_MAY_UNMAP, BDRV_REQ_WRITE_UNCHANGED) */
-    unsigned int supported_zero_flags;
-    /*
-     * Flags honoured during truncate (so far: BDRV_REQ_ZERO_WRITE).
-     *
-     * If BDRV_REQ_ZERO_WRITE is given, the truncate operation must make sure
-     * that any added space reads as all zeros. If this can't be guaranteed,
-     * the operation must fail.
-     */
-    unsigned int supported_truncate_flags;
-
-    /* the following member gives a name to every node on the bs graph. */
-    char node_name[32];
-    /* element of the list of named nodes building the graph */
-    QTAILQ_ENTRY(BlockDriverState) node_list;
-    /* element of the list of all BlockDriverStates (all_bdrv_states) */
-    QTAILQ_ENTRY(BlockDriverState) bs_list;
-    /* element of the list of monitor-owned BDS */
-    QTAILQ_ENTRY(BlockDriverState) monitor_list;
-    int refcnt;
-
-    /* operation blockers */
-    QLIST_HEAD(, BdrvOpBlocker) op_blockers[BLOCK_OP_TYPE_MAX];
-
-    /* The node that this node inherited default options from (and a reopen on
-     * which can affect this node by changing these defaults). This is always a
-     * parent node of this node. */
-    BlockDriverState *inherits_from;
-    QLIST_HEAD(, BdrvChild) children;
-    QLIST_HEAD(, BdrvChild) parents;
-
-    QDict *options;
-    QDict *explicit_options;
-    BlockdevDetectZeroesOptions detect_zeroes;
-
-    /* The error object in use for blocking operations on backing_hd */
-    Error *backing_blocker;
-
-    /* Protected by AioContext lock */
-
-    /* If we are reading a disk image, give its size in sectors.
-     * Generally read-only; it is written to by load_snapshot and
-     * save_snaphost, but the block layer is quiescent during those.
-     */
-    int64_t total_sectors;
-
-    /* threshold limit for writes, in bytes. "High water mark". */
-    uint64_t write_threshold_offset;
-
-    /* Writing to the list requires the BQL _and_ the dirty_bitmap_mutex.
-     * Reading from the list can be done with either the BQL or the
-     * dirty_bitmap_mutex.  Modifying a bitmap only requires
-     * dirty_bitmap_mutex.  */
-    QemuMutex dirty_bitmap_mutex;
-    QLIST_HEAD(, BdrvDirtyBitmap) dirty_bitmaps;
-
-    /* Offset after the highest byte written to */
-    Stat64 wr_highest_offset;
-
-    /* If true, copy read backing sectors into image.  Can be >1 if more
-     * than one client has requested copy-on-read.  Accessed with atomic
-     * ops.
-     */
-    int copy_on_read;
-
-    /* number of in-flight requests; overall and serialising.
-     * Accessed with atomic ops.
-     */
-    unsigned int in_flight;
-    unsigned int serialising_in_flight;
-
-    /* counter for nested bdrv_io_plug.
-     * Accessed with atomic ops.
-    */
-    unsigned io_plugged;
-
-    /* do we need to tell the quest if we have a volatile write cache? */
-    int enable_write_cache;
-
-    /* Accessed with atomic ops.  */
-    int quiesce_counter;
-    int recursive_quiesce_counter;
-
-    unsigned int write_gen;               /* Current data generation */
-
-    /* Protected by reqs_lock.  */
-    CoMutex reqs_lock;
-    QLIST_HEAD(, BdrvTrackedRequest) tracked_requests;
-    CoQueue flush_queue;                  /* Serializing flush queue */
-    bool active_flush_req;                /* Flush request in flight? */
-
-    /* Only read/written by whoever has set active_flush_req to true.  */
-    unsigned int flushed_gen;             /* Flushed write generation */
-
-    /* BdrvChild links to this node may never be frozen */
-    bool never_freeze;
-
-    /* Lock for block-status cache RCU writers */
-    CoMutex bsc_modify_lock;
-    /* Always non-NULL, but must only be dereferenced under an RCU read guard */
-    BdrvBlockStatusCache *block_status_cache;
-};
-
-struct BlockBackendRootState {
-    int open_flags;
-    BlockdevDetectZeroesOptions detect_zeroes;
-};
-
-typedef enum BlockMirrorBackingMode {
-    /* Reuse the existing backing chain from the source for the target.
-     * - sync=full: Set backing BDS to NULL.
-     * - sync=top:  Use source's backing BDS.
-     * - sync=none: Use source as the backing BDS. */
-    MIRROR_SOURCE_BACKING_CHAIN,
-
-    /* Open the target's backing chain completely anew */
-    MIRROR_OPEN_BACKING_CHAIN,
-
-    /* Do not change the target's backing BDS after job completion */
-    MIRROR_LEAVE_BACKING_CHAIN,
-} BlockMirrorBackingMode;
-
-
-/* Essential block drivers which must always be statically linked into qemu, and
- * which therefore can be accessed without using bdrv_find_format() */
-extern BlockDriver bdrv_file;
-extern BlockDriver bdrv_raw;
-extern BlockDriver bdrv_qcow2;
-
-int coroutine_fn bdrv_co_preadv(BdrvChild *child,
-    int64_t offset, int64_t bytes, QEMUIOVector *qiov,
-    BdrvRequestFlags flags);
-int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
-    int64_t offset, int64_t bytes,
-    QEMUIOVector *qiov, size_t qiov_offset, BdrvRequestFlags flags);
-int coroutine_fn bdrv_co_pwritev(BdrvChild *child,
-    int64_t offset, int64_t bytes, QEMUIOVector *qiov,
-    BdrvRequestFlags flags);
-int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
-    int64_t offset, int64_t bytes,
-    QEMUIOVector *qiov, size_t qiov_offset, BdrvRequestFlags flags);
-
-static inline int coroutine_fn bdrv_co_pread(BdrvChild *child,
-    int64_t offset, unsigned int bytes, void *buf, BdrvRequestFlags flags)
-{
-    QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
-
-    return bdrv_co_preadv(child, offset, bytes, &qiov, flags);
-}
-
-static inline int coroutine_fn bdrv_co_pwrite(BdrvChild *child,
-    int64_t offset, unsigned int bytes, void *buf, BdrvRequestFlags flags)
-{
-    QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
-
-    return bdrv_co_pwritev(child, offset, bytes, &qiov, flags);
-}
-
-extern unsigned int bdrv_drain_all_count;
-void bdrv_apply_subtree_drain(BdrvChild *child, BlockDriverState *new_parent);
-void bdrv_unapply_subtree_drain(BdrvChild *child, BlockDriverState *old_parent);
-
-bool coroutine_fn bdrv_make_request_serialising(BdrvTrackedRequest *req,
-                                                uint64_t align);
-BdrvTrackedRequest *coroutine_fn bdrv_co_get_self_request(BlockDriverState *bs);
-
-int get_tmp_filename(char *filename, int size);
-BlockDriver *bdrv_probe_all(const uint8_t *buf, int buf_size,
-                            const char *filename);
-
-void bdrv_parse_filename_strip_prefix(const char *filename, const char *prefix,
-                                      QDict *options);
-
-bool bdrv_backing_overridden(BlockDriverState *bs);
-
-
-/**
- * bdrv_add_aio_context_notifier:
- *
- * If a long-running job intends to be always run in the same AioContext as a
- * certain BDS, it may use this function to be notified of changes regarding the
- * association of the BDS to an AioContext.
- *
- * attached_aio_context() is called after the target BDS has been attached to a
- * new AioContext; detach_aio_context() is called before the target BDS is being
- * detached from its old AioContext.
- */
-void bdrv_add_aio_context_notifier(BlockDriverState *bs,
-        void (*attached_aio_context)(AioContext *new_context, void *opaque),
-        void (*detach_aio_context)(void *opaque), void *opaque);
-
-/**
- * bdrv_remove_aio_context_notifier:
- *
- * Unsubscribe of change notifications regarding the BDS's AioContext. The
- * parameters given here have to be the same as those given to
- * bdrv_add_aio_context_notifier().
- */
-void bdrv_remove_aio_context_notifier(BlockDriverState *bs,
-                                      void (*aio_context_attached)(AioContext *,
-                                                                   void *),
-                                      void (*aio_context_detached)(void *),
-                                      void *opaque);
-
-/**
- * bdrv_wakeup:
- * @bs: The BlockDriverState for which an I/O operation has been completed.
- *
- * Wake up the main thread if it is waiting on BDRV_POLL_WHILE.  During
- * synchronous I/O on a BlockDriverState that is attached to another
- * I/O thread, the main thread lets the I/O thread's event loop run,
- * waiting for the I/O operation to complete.  A bdrv_wakeup will wake
- * up the main thread if necessary.
- *
- * Manual calls to bdrv_wakeup are rarely necessary, because
- * bdrv_dec_in_flight already calls it.
- */
-void bdrv_wakeup(BlockDriverState *bs);
-
-#ifdef _WIN32
-int is_windows_drive(const char *filename);
-#endif
-
-/**
- * stream_start:
- * @job_id: The id of the newly-created job, or %NULL to use the
- * device name of @bs.
- * @bs: Block device to operate on.
- * @base: Block device that will become the new base, or %NULL to
- * flatten the whole backing file chain onto @bs.
- * @backing_file_str: The file name that will be written to @bs as the
- * the new backing file if the job completes. Ignored if @base is %NULL.
- * @creation_flags: Flags that control the behavior of the Job lifetime.
- *                  See @BlockJobCreateFlags
- * @speed: The maximum speed, in bytes per second, or 0 for unlimited.
- * @on_error: The action to take upon error.
- * @filter_node_name: The node name that should be assigned to the filter
- *                    driver that the stream job inserts into the graph above
- *                    @bs. NULL means that a node name should be autogenerated.
- * @errp: Error object.
- *
- * Start a streaming operation on @bs.  Clusters that are unallocated
- * in @bs, but allocated in any image between @base and @bs (both
- * exclusive) will be written to @bs.  At the end of a successful
- * streaming job, the backing file of @bs will be changed to
- * @backing_file_str in the written image and to @base in the live
- * BlockDriverState.
- */
-void stream_start(const char *job_id, BlockDriverState *bs,
-                  BlockDriverState *base, const char *backing_file_str,
-                  BlockDriverState *bottom,
-                  int creation_flags, int64_t speed,
-                  BlockdevOnError on_error,
-                  const char *filter_node_name,
-                  Error **errp);
-
-/**
- * commit_start:
- * @job_id: The id of the newly-created job, or %NULL to use the
- * device name of @bs.
- * @bs: Active block device.
- * @top: Top block device to be committed.
- * @base: Block device that will be written into, and become the new top.
- * @creation_flags: Flags that control the behavior of the Job lifetime.
- *                  See @BlockJobCreateFlags
- * @speed: The maximum speed, in bytes per second, or 0 for unlimited.
- * @on_error: The action to take upon error.
- * @backing_file_str: String to use as the backing file in @top's overlay
- * @filter_node_name: The node name that should be assigned to the filter
- * driver that the commit job inserts into the graph above @top. NULL means
- * that a node name should be autogenerated.
- * @errp: Error object.
- *
- */
-void commit_start(const char *job_id, BlockDriverState *bs,
-                  BlockDriverState *base, BlockDriverState *top,
-                  int creation_flags, int64_t speed,
-                  BlockdevOnError on_error, const char *backing_file_str,
-                  const char *filter_node_name, Error **errp);
-/**
- * commit_active_start:
- * @job_id: The id of the newly-created job, or %NULL to use the
- * device name of @bs.
- * @bs: Active block device to be committed.
- * @base: Block device that will be written into, and become the new top.
- * @creation_flags: Flags that control the behavior of the Job lifetime.
- *                  See @BlockJobCreateFlags
- * @speed: The maximum speed, in bytes per second, or 0 for unlimited.
- * @on_error: The action to take upon error.
- * @filter_node_name: The node name that should be assigned to the filter
- * driver that the commit job inserts into the graph above @bs. NULL means that
- * a node name should be autogenerated.
- * @cb: Completion function for the job.
- * @opaque: Opaque pointer value passed to @cb.
- * @auto_complete: Auto complete the job.
- * @errp: Error object.
- *
- */
-BlockJob *commit_active_start(const char *job_id, BlockDriverState *bs,
-                              BlockDriverState *base, int creation_flags,
-                              int64_t speed, BlockdevOnError on_error,
-                              const char *filter_node_name,
-                              BlockCompletionFunc *cb, void *opaque,
-                              bool auto_complete, Error **errp);
-/*
- * mirror_start:
- * @job_id: The id of the newly-created job, or %NULL to use the
- * device name of @bs.
- * @bs: Block device to operate on.
- * @target: Block device to write to.
- * @replaces: Block graph node name to replace once the mirror is done. Can
- *            only be used when full mirroring is selected.
- * @creation_flags: Flags that control the behavior of the Job lifetime.
- *                  See @BlockJobCreateFlags
- * @speed: The maximum speed, in bytes per second, or 0 for unlimited.
- * @granularity: The chosen granularity for the dirty bitmap.
- * @buf_size: The amount of data that can be in flight at one time.
- * @mode: Whether to collapse all images in the chain to the target.
- * @backing_mode: How to establish the target's backing chain after completion.
- * @zero_target: Whether the target should be explicitly zero-initialized
- * @on_source_error: The action to take upon error reading from the source.
- * @on_target_error: The action to take upon error writing to the target.
- * @unmap: Whether to unmap target where source sectors only contain zeroes.
- * @filter_node_name: The node name that should be assigned to the filter
- * driver that the mirror job inserts into the graph above @bs. NULL means that
- * a node name should be autogenerated.
- * @copy_mode: When to trigger writes to the target.
- * @errp: Error object.
- *
- * Start a mirroring operation on @bs.  Clusters that are allocated
- * in @bs will be written to @target until the job is cancelled or
- * manually completed.  At the end of a successful mirroring job,
- * @bs will be switched to read from @target.
- */
-void mirror_start(const char *job_id, BlockDriverState *bs,
-                  BlockDriverState *target, const char *replaces,
-                  int creation_flags, int64_t speed,
-                  uint32_t granularity, int64_t buf_size,
-                  MirrorSyncMode mode, BlockMirrorBackingMode backing_mode,
-                  bool zero_target,
-                  BlockdevOnError on_source_error,
-                  BlockdevOnError on_target_error,
-                  bool unmap, const char *filter_node_name,
-                  MirrorCopyMode copy_mode, Error **errp);
-
-/*
- * backup_job_create:
- * @job_id: The id of the newly-created job, or %NULL to use the
- * device name of @bs.
- * @bs: Block device to operate on.
- * @target: Block device to write to.
- * @speed: The maximum speed, in bytes per second, or 0 for unlimited.
- * @sync_mode: What parts of the disk image should be copied to the destination.
- * @sync_bitmap: The dirty bitmap if sync_mode is 'bitmap' or 'incremental'
- * @bitmap_mode: The bitmap synchronization policy to use.
- * @perf: Performance options. All actual fields assumed to be present,
- *        all ".has_*" fields are ignored.
- * @on_source_error: The action to take upon error reading from the source.
- * @on_target_error: The action to take upon error writing to the target.
- * @creation_flags: Flags that control the behavior of the Job lifetime.
- *                  See @BlockJobCreateFlags
- * @cb: Completion function for the job.
- * @opaque: Opaque pointer value passed to @cb.
- * @txn: Transaction that this job is part of (may be NULL).
- *
- * Create a backup operation on @bs.  Clusters in @bs are written to @target
- * until the job is cancelled or manually completed.
- */
-BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
-                            BlockDriverState *target, int64_t speed,
-                            MirrorSyncMode sync_mode,
-                            BdrvDirtyBitmap *sync_bitmap,
-                            BitmapSyncMode bitmap_mode,
-                            bool compress,
-                            const char *filter_node_name,
-                            BackupPerf *perf,
-                            BlockdevOnError on_source_error,
-                            BlockdevOnError on_target_error,
-                            int creation_flags,
-                            BlockCompletionFunc *cb, void *opaque,
-                            JobTxn *txn, Error **errp);
-
-BdrvChild *bdrv_root_attach_child(BlockDriverState *child_bs,
-                                  const char *child_name,
-                                  const BdrvChildClass *child_class,
-                                  BdrvChildRole child_role,
-                                  uint64_t perm, uint64_t shared_perm,
-                                  void *opaque, Error **errp);
-void bdrv_root_unref_child(BdrvChild *child);
-
-void bdrv_get_cumulative_perm(BlockDriverState *bs, uint64_t *perm,
-                              uint64_t *shared_perm);
-
-/**
- * Sets a BdrvChild's permissions.  Avoid if the parent is a BDS; use
- * bdrv_child_refresh_perms() instead and make the parent's
- * .bdrv_child_perm() implementation return the correct values.
- */
-int bdrv_child_try_set_perm(BdrvChild *c, uint64_t perm, uint64_t shared,
-                            Error **errp);
-
-/**
- * Calls bs->drv->bdrv_child_perm() and updates the child's permission
- * masks with the result.
- * Drivers should invoke this function whenever an event occurs that
- * makes their .bdrv_child_perm() implementation return different
- * values than before, but which will not result in the block layer
- * automatically refreshing the permissions.
- */
-int bdrv_child_refresh_perms(BlockDriverState *bs, BdrvChild *c, Error **errp);
-
-bool bdrv_recurse_can_replace(BlockDriverState *bs,
-                              BlockDriverState *to_replace);
-
-/*
- * Default implementation for BlockDriver.bdrv_child_perm() that can
- * be used by block filters and image formats, as long as they use the
- * child_of_bds child class and set an appropriate BdrvChildRole.
- */
-void bdrv_default_perms(BlockDriverState *bs, BdrvChild *c,
-                        BdrvChildRole role, BlockReopenQueue *reopen_queue,
-                        uint64_t perm, uint64_t shared,
-                        uint64_t *nperm, uint64_t *nshared);
-
-const char *bdrv_get_parent_name(const BlockDriverState *bs);
-void blk_dev_change_media_cb(BlockBackend *blk, bool load, Error **errp);
-bool blk_dev_has_removable_media(BlockBackend *blk);
-bool blk_dev_has_tray(BlockBackend *blk);
-void blk_dev_eject_request(BlockBackend *blk, bool force);
-bool blk_dev_is_tray_open(BlockBackend *blk);
-bool blk_dev_is_medium_locked(BlockBackend *blk);
-
-void bdrv_set_dirty(BlockDriverState *bs, int64_t offset, int64_t bytes);
-
-void bdrv_clear_dirty_bitmap(BdrvDirtyBitmap *bitmap, HBitmap **out);
-void bdrv_restore_dirty_bitmap(BdrvDirtyBitmap *bitmap, HBitmap *backup);
-bool bdrv_dirty_bitmap_merge_internal(BdrvDirtyBitmap *dest,
-                                      const BdrvDirtyBitmap *src,
-                                      HBitmap **backup, bool lock);
-
-void bdrv_inc_in_flight(BlockDriverState *bs);
-void bdrv_dec_in_flight(BlockDriverState *bs);
-
-void blockdev_close_all_bdrv_states(void);
-
-int coroutine_fn bdrv_co_copy_range_from(BdrvChild *src, int64_t src_offset,
-                                         BdrvChild *dst, int64_t dst_offset,
-                                         int64_t bytes,
-                                         BdrvRequestFlags read_flags,
-                                         BdrvRequestFlags write_flags);
-int coroutine_fn bdrv_co_copy_range_to(BdrvChild *src, int64_t src_offset,
-                                       BdrvChild *dst, int64_t dst_offset,
-                                       int64_t bytes,
-                                       BdrvRequestFlags read_flags,
-                                       BdrvRequestFlags write_flags);
-
-int refresh_total_sectors(BlockDriverState *bs, int64_t hint);
-
-void bdrv_set_monitor_owned(BlockDriverState *bs);
-BlockDriverState *bds_tree_init(QDict *bs_opts, Error **errp);
-
-/**
- * Simple implementation of bdrv_co_create_opts for protocol drivers
- * which only support creation via opening a file
- * (usually existing raw storage device)
- */
-int coroutine_fn bdrv_co_create_opts_simple(BlockDriver *drv,
-                                            const char *filename,
-                                            QemuOpts *opts,
-                                            Error **errp);
-extern QemuOptsList bdrv_create_opts_simple;
-
-BdrvDirtyBitmap *block_dirty_bitmap_lookup(const char *node,
-                                           const char *name,
-                                           BlockDriverState **pbs,
-                                           Error **errp);
-BdrvDirtyBitmap *block_dirty_bitmap_merge(const char *node, const char *target,
-                                          BlockDirtyBitmapMergeSourceList *bms,
-                                          HBitmap **backup, Error **errp);
-BdrvDirtyBitmap *block_dirty_bitmap_remove(const char *node, const char *name,
-                                           bool release,
-                                           BlockDriverState **bitmap_bs,
-                                           Error **errp);
-
-BdrvChild *bdrv_cow_child(BlockDriverState *bs);
-BdrvChild *bdrv_filter_child(BlockDriverState *bs);
-BdrvChild *bdrv_filter_or_cow_child(BlockDriverState *bs);
-BdrvChild *bdrv_primary_child(BlockDriverState *bs);
-BlockDriverState *bdrv_skip_implicit_filters(BlockDriverState *bs);
-BlockDriverState *bdrv_skip_filters(BlockDriverState *bs);
-BlockDriverState *bdrv_backing_chain_next(BlockDriverState *bs);
-
-static inline BlockDriverState *child_bs(BdrvChild *child)
-{
-    return child ? child->bs : NULL;
-}
-
-static inline BlockDriverState *bdrv_cow_bs(BlockDriverState *bs)
-{
-    return child_bs(bdrv_cow_child(bs));
-}
-
-static inline BlockDriverState *bdrv_filter_bs(BlockDriverState *bs)
-{
-    return child_bs(bdrv_filter_child(bs));
-}
-
-static inline BlockDriverState *bdrv_filter_or_cow_bs(BlockDriverState *bs)
-{
-    return child_bs(bdrv_filter_or_cow_child(bs));
-}
-
-static inline BlockDriverState *bdrv_primary_bs(BlockDriverState *bs)
-{
-    return child_bs(bdrv_primary_child(bs));
-}
-
-/**
- * End all quiescent sections started by bdrv_drain_all_begin(). This is
- * needed when deleting a BDS before bdrv_drain_all_end() is called.
- *
- * NOTE: this is an internal helper for bdrv_close() *only*. No one else
- * should call it.
- */
-void bdrv_drain_all_end_quiesce(BlockDriverState *bs);
-
-/**
- * Check whether the given offset is in the cached block-status data
- * region.
- *
- * If it is, and @pnum is not NULL, *pnum is set to
- * `bsc.data_end - offset`, i.e. how many bytes, starting from
- * @offset, are data (according to the cache).
- * Otherwise, *pnum is not touched.
- */
-bool bdrv_bsc_is_data(BlockDriverState *bs, int64_t offset, int64_t *pnum);
-
-/**
- * If [offset, offset + bytes) overlaps with the currently cached
- * block-status region, invalidate the cache.
- *
- * (To be used by I/O paths that cause data regions to be zero or
- * holes.)
- */
-void bdrv_bsc_invalidate_range(BlockDriverState *bs,
-                               int64_t offset, int64_t bytes);
-
-/**
- * Mark the range [offset, offset + bytes) as a data region.
- */
-void bdrv_bsc_fill(BlockDriverState *bs, int64_t offset, int64_t bytes);
+/* DO NOT ADD ANYTHING IN HERE. USE ONE OF THE HEADERS INCLUDED ABOVE */
 
 #endif /* BLOCK_INT_H */
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 07/25] assertions for block_int global state API
  2021-10-25 10:17 [PATCH v4 00/25] block layer: split block APIs in global state and I/O Emanuele Giuseppe Esposito
                   ` (5 preceding siblings ...)
  2021-10-25 10:17 ` [PATCH v4 06/25] include/block/block_int: split header into I/O and global state API Emanuele Giuseppe Esposito
@ 2021-10-25 10:17 ` Emanuele Giuseppe Esposito
  2021-11-12 13:51   ` Hanna Reitz
  2021-10-25 10:17 ` [PATCH v4 08/25] block: introduce assert_bdrv_graph_writable Emanuele Giuseppe Esposito
                   ` (20 subsequent siblings)
  27 siblings, 1 reply; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-10-25 10:17 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Emanuele Giuseppe Esposito, Richard Henderson, Markus Armbruster,
	Dr. David Alan Gilbert, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Eric Blake

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block.c                         | 17 +++++++++++++++++
 block/backup.c                  |  1 +
 block/block-backend.c           |  3 +++
 block/commit.c                  |  2 ++
 block/dirty-bitmap.c            |  1 +
 block/io.c                      |  6 ++++++
 block/mirror.c                  |  4 ++++
 block/monitor/bitmap-qmp-cmds.c |  6 ++++++
 block/stream.c                  |  2 ++
 blockdev.c                      |  7 +++++++
 10 files changed, 49 insertions(+)

diff --git a/block.c b/block.c
index 672f946065..41c5883c5c 100644
--- a/block.c
+++ b/block.c
@@ -653,6 +653,8 @@ int coroutine_fn bdrv_co_create_opts_simple(BlockDriver *drv,
     Error *local_err = NULL;
     int ret;
 
+    assert(qemu_in_main_thread());
+
     size = qemu_opt_get_size_del(opts, BLOCK_OPT_SIZE, 0);
     buf = qemu_opt_get_del(opts, BLOCK_OPT_PREALLOC);
     prealloc = qapi_enum_parse(&PreallocMode_lookup, buf,
@@ -2428,6 +2430,8 @@ void bdrv_get_cumulative_perm(BlockDriverState *bs, uint64_t *perm,
     uint64_t cumulative_perms = 0;
     uint64_t cumulative_shared_perms = BLK_PERM_ALL;
 
+    assert(qemu_in_main_thread());
+
     QLIST_FOREACH(c, &bs->parents, next_parent) {
         cumulative_perms |= c->perm;
         cumulative_shared_perms &= c->shared_perm;
@@ -2486,6 +2490,8 @@ int bdrv_child_try_set_perm(BdrvChild *c, uint64_t perm, uint64_t shared,
     Transaction *tran = tran_new();
     int ret;
 
+    assert(qemu_in_main_thread());
+
     bdrv_child_set_perm(c, perm, shared, tran);
 
     ret = bdrv_refresh_perms(c->bs, &local_err);
@@ -2516,6 +2522,8 @@ int bdrv_child_refresh_perms(BlockDriverState *bs, BdrvChild *c, Error **errp)
     uint64_t parent_perms, parent_shared;
     uint64_t perms, shared;
 
+    assert(qemu_in_main_thread());
+
     bdrv_get_cumulative_perm(bs, &parent_perms, &parent_shared);
     bdrv_child_perm(bs, c->bs, c, c->role, NULL,
                     parent_perms, parent_shared, &perms, &shared);
@@ -2658,6 +2666,7 @@ void bdrv_default_perms(BlockDriverState *bs, BdrvChild *c,
                         uint64_t perm, uint64_t shared,
                         uint64_t *nperm, uint64_t *nshared)
 {
+    assert(qemu_in_main_thread());
     if (role & BDRV_CHILD_FILTERED) {
         assert(!(role & (BDRV_CHILD_DATA | BDRV_CHILD_METADATA |
                          BDRV_CHILD_COW)));
@@ -2984,6 +2993,8 @@ BdrvChild *bdrv_root_attach_child(BlockDriverState *child_bs,
     BdrvChild *child = NULL;
     Transaction *tran = tran_new();
 
+    assert(qemu_in_main_thread());
+
     ret = bdrv_attach_child_common(child_bs, child_name, child_class,
                                    child_role, perm, shared_perm, opaque,
                                    &child, tran, errp);
@@ -6027,6 +6038,8 @@ const char *bdrv_get_parent_name(const BlockDriverState *bs)
     BdrvChild *c;
     const char *name;
 
+    assert(qemu_in_main_thread());
+
     /* If multiple parents have a name, just pick the first one. */
     QLIST_FOREACH(c, &bs->parents, next_parent) {
         if (c->klass->get_name) {
@@ -7302,6 +7315,8 @@ bool bdrv_recurse_can_replace(BlockDriverState *bs,
 {
     BlockDriverState *filtered;
 
+    assert(qemu_in_main_thread());
+
     if (!bs || !bs->drv) {
         return false;
     }
@@ -7473,6 +7488,7 @@ static bool append_strong_runtime_options(QDict *d, BlockDriverState *bs)
  * would result in exactly bs->backing. */
 bool bdrv_backing_overridden(BlockDriverState *bs)
 {
+    assert(qemu_in_main_thread());
     if (bs->backing) {
         return strcmp(bs->auto_backing_file,
                       bs->backing->bs->filename);
@@ -7861,6 +7877,7 @@ static BlockDriverState *bdrv_do_skip_filters(BlockDriverState *bs,
  */
 BlockDriverState *bdrv_skip_implicit_filters(BlockDriverState *bs)
 {
+    assert(qemu_in_main_thread());
     return bdrv_do_skip_filters(bs, true);
 }
 
diff --git a/block/backup.c b/block/backup.c
index 21d5983779..c53041772c 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -372,6 +372,7 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
 
     assert(bs);
     assert(target);
+    assert(qemu_in_main_thread());
 
     /* QMP interface protects us from these cases */
     assert(sync_mode != MIRROR_SYNC_MODE_INCREMENTAL);
diff --git a/block/block-backend.c b/block/block-backend.c
index ed45576007..fa30bb88ea 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1087,6 +1087,7 @@ static void blk_root_change_media(BdrvChild *child, bool load)
  */
 bool blk_dev_has_removable_media(BlockBackend *blk)
 {
+    assert(qemu_in_main_thread());
     return !blk->dev || (blk->dev_ops && blk->dev_ops->change_media_cb);
 }
 
@@ -1104,6 +1105,7 @@ bool blk_dev_has_tray(BlockBackend *blk)
  */
 void blk_dev_eject_request(BlockBackend *blk, bool force)
 {
+    assert(qemu_in_main_thread());
     if (blk->dev_ops && blk->dev_ops->eject_request_cb) {
         blk->dev_ops->eject_request_cb(blk->dev_opaque, force);
     }
@@ -1126,6 +1128,7 @@ bool blk_dev_is_tray_open(BlockBackend *blk)
  */
 bool blk_dev_is_medium_locked(BlockBackend *blk)
 {
+    assert(qemu_in_main_thread());
     if (blk->dev_ops && blk->dev_ops->is_medium_locked) {
         return blk->dev_ops->is_medium_locked(blk->dev_opaque);
     }
diff --git a/block/commit.c b/block/commit.c
index 45a414a19b..f639eb49c5 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -253,6 +253,8 @@ void commit_start(const char *job_id, BlockDriverState *bs,
     uint64_t base_perms, iter_shared_perms;
     int ret;
 
+    assert(qemu_in_main_thread());
+
     assert(top != bs);
     if (bdrv_skip_filters(top) == bdrv_skip_filters(base)) {
         error_setg(errp, "Invalid files for merge: top and base are the same");
diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 0ef46163e3..49462ca121 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -673,6 +673,7 @@ void bdrv_restore_dirty_bitmap(BdrvDirtyBitmap *bitmap, HBitmap *backup)
 {
     HBitmap *tmp = bitmap->bitmap;
     assert(!bdrv_dirty_bitmap_readonly(bitmap));
+    assert(qemu_in_main_thread());
     bitmap->bitmap = backup;
     hbitmap_free(tmp);
 }
diff --git a/block/io.c b/block/io.c
index c5d7f8495e..f271ab3684 100644
--- a/block/io.c
+++ b/block/io.c
@@ -560,6 +560,7 @@ void bdrv_subtree_drained_end(BlockDriverState *bs)
 void bdrv_apply_subtree_drain(BdrvChild *child, BlockDriverState *new_parent)
 {
     int i;
+    assert(qemu_in_main_thread());
 
     for (i = 0; i < new_parent->recursive_quiesce_counter; i++) {
         bdrv_do_drained_begin(child->bs, true, child, false, true);
@@ -571,6 +572,8 @@ void bdrv_unapply_subtree_drain(BdrvChild *child, BlockDriverState *old_parent)
     int drained_end_counter = 0;
     int i;
 
+    assert(qemu_in_main_thread());
+
     for (i = 0; i < old_parent->recursive_quiesce_counter; i++) {
         bdrv_do_drained_end(child->bs, true, child, false,
                             &drained_end_counter);
@@ -690,6 +693,7 @@ void bdrv_drain_all_end_quiesce(BlockDriverState *bs)
 {
     int drained_end_counter = 0;
 
+    assert(qemu_in_main_thread());
     g_assert(bs->quiesce_counter > 0);
     g_assert(!bs->refcnt);
 
@@ -3419,6 +3423,7 @@ int coroutine_fn bdrv_co_copy_range_from(BdrvChild *src, int64_t src_offset,
 {
     trace_bdrv_co_copy_range_from(src, src_offset, dst, dst_offset, bytes,
                                   read_flags, write_flags);
+    assert(qemu_in_main_thread());
     return bdrv_co_copy_range_internal(src, src_offset, dst, dst_offset,
                                        bytes, read_flags, write_flags, true);
 }
@@ -3435,6 +3440,7 @@ int coroutine_fn bdrv_co_copy_range_to(BdrvChild *src, int64_t src_offset,
 {
     trace_bdrv_co_copy_range_to(src, src_offset, dst, dst_offset, bytes,
                                 read_flags, write_flags);
+    assert(qemu_in_main_thread());
     return bdrv_co_copy_range_internal(src, src_offset, dst, dst_offset,
                                        bytes, read_flags, write_flags, false);
 }
diff --git a/block/mirror.c b/block/mirror.c
index efec2c7674..00089e519b 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1880,6 +1880,8 @@ void mirror_start(const char *job_id, BlockDriverState *bs,
     bool is_none_mode;
     BlockDriverState *base;
 
+    assert(qemu_in_main_thread());
+
     if ((mode == MIRROR_SYNC_MODE_INCREMENTAL) ||
         (mode == MIRROR_SYNC_MODE_BITMAP)) {
         error_setg(errp, "Sync mode '%s' not supported",
@@ -1905,6 +1907,8 @@ BlockJob *commit_active_start(const char *job_id, BlockDriverState *bs,
     bool base_read_only;
     BlockJob *job;
 
+    assert(qemu_in_main_thread());
+
     base_read_only = bdrv_is_read_only(base);
 
     if (base_read_only) {
diff --git a/block/monitor/bitmap-qmp-cmds.c b/block/monitor/bitmap-qmp-cmds.c
index 9f11deec64..8b8d30287a 100644
--- a/block/monitor/bitmap-qmp-cmds.c
+++ b/block/monitor/bitmap-qmp-cmds.c
@@ -56,6 +56,8 @@ BdrvDirtyBitmap *block_dirty_bitmap_lookup(const char *node,
     BlockDriverState *bs;
     BdrvDirtyBitmap *bitmap;
 
+    assert(qemu_in_main_thread());
+
     if (!node) {
         error_setg(errp, "Node cannot be NULL");
         return NULL;
@@ -155,6 +157,8 @@ BdrvDirtyBitmap *block_dirty_bitmap_remove(const char *node, const char *name,
     BdrvDirtyBitmap *bitmap;
     AioContext *aio_context;
 
+    assert(qemu_in_main_thread());
+
     bitmap = block_dirty_bitmap_lookup(node, name, &bs, errp);
     if (!bitmap || !bs) {
         return NULL;
@@ -261,6 +265,8 @@ BdrvDirtyBitmap *block_dirty_bitmap_merge(const char *node, const char *target,
     BlockDirtyBitmapMergeSourceList *lst;
     Error *local_err = NULL;
 
+    assert(qemu_in_main_thread());
+
     dst = block_dirty_bitmap_lookup(node, target, &bs, errp);
     if (!dst) {
         return NULL;
diff --git a/block/stream.c b/block/stream.c
index 97bee482dc..b85c3742b2 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -216,6 +216,8 @@ void stream_start(const char *job_id, BlockDriverState *bs,
     QDict *opts;
     int ret;
 
+    assert(qemu_in_main_thread());
+
     assert(!(base && bottom));
     assert(!(backing_file_str && bottom));
 
diff --git a/blockdev.c b/blockdev.c
index ddba382abd..c1f6171c6c 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -69,6 +69,7 @@ QTAILQ_HEAD(, BlockDriverState) monitor_bdrv_states =
 
 void bdrv_set_monitor_owned(BlockDriverState *bs)
 {
+    assert(qemu_in_main_thread());
     QTAILQ_INSERT_TAIL(&monitor_bdrv_states, bs, monitor_list);
 }
 
@@ -661,6 +662,7 @@ BlockDriverState *bds_tree_init(QDict *bs_opts, Error **errp)
 {
     int bdrv_flags = 0;
 
+    assert(qemu_in_main_thread());
     /* bdrv_open() defaults to the values in bdrv_flags (for compatibility
      * with other callers) rather than what we want as the real defaults.
      * Apply the defaults here instead. */
@@ -679,6 +681,7 @@ void blockdev_close_all_bdrv_states(void)
 {
     BlockDriverState *bs, *next_bs;
 
+    assert(qemu_in_main_thread());
     QTAILQ_FOREACH_SAFE(bs, &monitor_bdrv_states, monitor_list, next_bs) {
         AioContext *ctx = bdrv_get_aio_context(bs);
 
@@ -2332,6 +2335,8 @@ void qmp_transaction(TransactionActionList *dev_list,
     BlkActionState *state, *next;
     Error *local_err = NULL;
 
+    assert(qemu_in_main_thread());
+
     QTAILQ_HEAD(, BlkActionState) snap_bdrv_states;
     QTAILQ_INIT(&snap_bdrv_states);
 
@@ -3625,6 +3630,8 @@ void qmp_blockdev_del(const char *node_name, Error **errp)
     AioContext *aio_context;
     BlockDriverState *bs;
 
+    assert(qemu_in_main_thread());
+
     bs = bdrv_find_node(node_name);
     if (!bs) {
         error_setg(errp, "Failed to find node with node-name='%s'", node_name);
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 08/25] block: introduce assert_bdrv_graph_writable
  2021-10-25 10:17 [PATCH v4 00/25] block layer: split block APIs in global state and I/O Emanuele Giuseppe Esposito
                   ` (6 preceding siblings ...)
  2021-10-25 10:17 ` [PATCH v4 07/25] assertions for block_int " Emanuele Giuseppe Esposito
@ 2021-10-25 10:17 ` Emanuele Giuseppe Esposito
  2021-11-12 14:40   ` Hanna Reitz
  2021-10-25 10:17 ` [PATCH v4 09/25] include/block/blockjob_int.h: split header into I/O and GS API Emanuele Giuseppe Esposito
                   ` (19 subsequent siblings)
  27 siblings, 1 reply; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-10-25 10:17 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Emanuele Giuseppe Esposito, Richard Henderson, Markus Armbruster,
	Dr. David Alan Gilbert, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Eric Blake

We want to be sure that the functions that write the child and
parent list of a bs are under BQL and drain.

BQL prevents from concurrent writings from the GS API, while
drains protect from I/O.

TODO: drains are missing in some functions using this assert.
Therefore a proper assertion will fail. Because adding drains
requires additional discussions, they will be added in future
series.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block.c                                |  5 +++++
 block/io.c                             | 11 +++++++++++
 include/block/block_int-global-state.h | 10 +++++++++-
 3 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/block.c b/block.c
index 41c5883c5c..94bff5c757 100644
--- a/block.c
+++ b/block.c
@@ -2734,12 +2734,14 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
         if (child->klass->detach) {
             child->klass->detach(child);
         }
+        assert_bdrv_graph_writable(old_bs);
         QLIST_REMOVE(child, next_parent);
     }
 
     child->bs = new_bs;
 
     if (new_bs) {
+        assert_bdrv_graph_writable(new_bs);
         QLIST_INSERT_HEAD(&new_bs->parents, child, next_parent);
 
         /*
@@ -2940,6 +2942,7 @@ static int bdrv_attach_child_noperm(BlockDriverState *parent_bs,
         return ret;
     }
 
+    assert_bdrv_graph_writable(parent_bs);
     QLIST_INSERT_HEAD(&parent_bs->children, *child, next);
     /*
      * child is removed in bdrv_attach_child_common_abort(), so don't care to
@@ -3140,6 +3143,7 @@ static void bdrv_unset_inherits_from(BlockDriverState *root, BdrvChild *child,
 void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child)
 {
     assert(qemu_in_main_thread());
+    assert_bdrv_graph_writable(parent);
     if (child == NULL) {
         return;
     }
@@ -4903,6 +4907,7 @@ static void bdrv_remove_filter_or_cow_child_abort(void *opaque)
     BdrvRemoveFilterOrCowChild *s = opaque;
     BlockDriverState *parent_bs = s->child->opaque;
 
+    assert_bdrv_graph_writable(parent_bs);
     QLIST_INSERT_HEAD(&parent_bs->children, s->child, next);
     if (s->is_backing) {
         parent_bs->backing = s->child;
diff --git a/block/io.c b/block/io.c
index f271ab3684..1c71e354d6 100644
--- a/block/io.c
+++ b/block/io.c
@@ -740,6 +740,17 @@ void bdrv_drain_all(void)
     bdrv_drain_all_end();
 }
 
+void assert_bdrv_graph_writable(BlockDriverState *bs)
+{
+    /*
+     * TODO: this function is incomplete. Because the users of this
+     * assert lack the necessary drains, check only for BQL.
+     * Once the necessary drains are added,
+     * assert also for qatomic_read(&bs->quiesce_counter) > 0
+     */
+    assert(qemu_in_main_thread());
+}
+
 /**
  * Remove an active request from the tracked requests list
  *
diff --git a/include/block/block_int-global-state.h b/include/block/block_int-global-state.h
index d08e80222c..6bd7746409 100644
--- a/include/block/block_int-global-state.h
+++ b/include/block/block_int-global-state.h
@@ -316,4 +316,12 @@ void bdrv_remove_aio_context_notifier(BlockDriverState *bs,
  */
 void bdrv_drain_all_end_quiesce(BlockDriverState *bs);
 
-#endif /* BLOCK_INT_GLOBAL_STATE*/
+/**
+ * Make sure that the function is either running under
+ * drain and BQL. The latter protects from concurrent writings
+ * from the GS API, while the former prevents concurrent reads
+ * from I/O.
+ */
+void assert_bdrv_graph_writable(BlockDriverState *bs);
+
+#endif /* BLOCK_INT_GLOBAL_STATE */
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 09/25] include/block/blockjob_int.h: split header into I/O and GS API
  2021-10-25 10:17 [PATCH v4 00/25] block layer: split block APIs in global state and I/O Emanuele Giuseppe Esposito
                   ` (7 preceding siblings ...)
  2021-10-25 10:17 ` [PATCH v4 08/25] block: introduce assert_bdrv_graph_writable Emanuele Giuseppe Esposito
@ 2021-10-25 10:17 ` Emanuele Giuseppe Esposito
  2021-10-25 10:17 ` [PATCH v4 10/25] assertions for blockjob_int.h Emanuele Giuseppe Esposito
                   ` (18 subsequent siblings)
  27 siblings, 0 replies; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-10-25 10:17 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Emanuele Giuseppe Esposito, Richard Henderson, Markus Armbruster,
	Dr. David Alan Gilbert, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Eric Blake

Since the I/O functions are not many, keep a single file.
Also split the function pointers in BlockJobDriver.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/block/blockjob_int.h | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/include/block/blockjob_int.h b/include/block/blockjob_int.h
index 6633d83da2..718d7b92d2 100644
--- a/include/block/blockjob_int.h
+++ b/include/block/blockjob_int.h
@@ -38,6 +38,13 @@ struct BlockJobDriver {
     /** Generic JobDriver callbacks and settings */
     JobDriver job_driver;
 
+    /*
+     * I/O API functions. These functions are thread-safe.
+     *
+     * See include/block/block-io.h for more information about
+     * the I/O API.
+     */
+
     /*
      * Returns whether the job has pending requests for the child or will
      * submit new requests before the next pause point. This callback is polled
@@ -46,6 +53,13 @@ struct BlockJobDriver {
      */
     bool (*drained_poll)(BlockJob *job);
 
+    /*
+     * Global state (GS) API. These functions run under the BQL lock.
+     *
+     * See include/block/block-global-state.h for more information about
+     * the GS API.
+     */
+
     /*
      * If the callback is not NULL, it will be invoked before the job is
      * resumed in a new AioContext.  This is the place to move any resources
@@ -56,6 +70,13 @@ struct BlockJobDriver {
     void (*set_speed)(BlockJob *job, int64_t speed);
 };
 
+/*
+ * Global state (GS) API. These functions run under the BQL lock.
+ *
+ * See include/block/block-global-state.h for more information about
+ * the GS API.
+ */
+
 /**
  * block_job_create:
  * @job_id: The id of the newly-created job, or %NULL to have one
@@ -98,6 +119,13 @@ void block_job_free(Job *job);
  */
 void block_job_user_resume(Job *job);
 
+/*
+ * I/O API functions. These functions are thread-safe.
+ *
+ * See include/block/block-io.h for more information about
+ * the I/O API.
+ */
+
 /**
  * block_job_ratelimit_get_delay:
  *
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 10/25] assertions for blockjob_int.h
  2021-10-25 10:17 [PATCH v4 00/25] block layer: split block APIs in global state and I/O Emanuele Giuseppe Esposito
                   ` (8 preceding siblings ...)
  2021-10-25 10:17 ` [PATCH v4 09/25] include/block/blockjob_int.h: split header into I/O and GS API Emanuele Giuseppe Esposito
@ 2021-10-25 10:17 ` Emanuele Giuseppe Esposito
  2021-11-12 15:17   ` Hanna Reitz
  2021-10-25 10:17 ` [PATCH v4 11/25] include/block/blockjob.h: global state API Emanuele Giuseppe Esposito
                   ` (17 subsequent siblings)
  27 siblings, 1 reply; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-10-25 10:17 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Emanuele Giuseppe Esposito, Richard Henderson, Markus Armbruster,
	Dr. David Alan Gilbert, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Eric Blake

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 blockjob.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/blockjob.c b/blockjob.c
index 4bad1408cb..fbd6c7d873 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -83,6 +83,7 @@ BlockJob *block_job_get(const char *id)
 
 void block_job_free(Job *job)
 {
+    assert(qemu_in_main_thread());
     BlockJob *bjob = container_of(job, BlockJob, job);
 
     block_job_remove_all_bdrv(bjob);
@@ -436,6 +437,8 @@ void *block_job_create(const char *job_id, const BlockJobDriver *driver,
     BlockBackend *blk;
     BlockJob *job;
 
+    assert(qemu_in_main_thread());
+
     if (job_id == NULL && !(flags & JOB_INTERNAL)) {
         job_id = bdrv_get_device_name(bs);
     }
@@ -504,6 +507,7 @@ void block_job_iostatus_reset(BlockJob *job)
 
 void block_job_user_resume(Job *job)
 {
+    assert(qemu_in_main_thread());
     BlockJob *bjob = container_of(job, BlockJob, job);
     block_job_iostatus_reset(bjob);
 }
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 11/25] include/block/blockjob.h: global state API
  2021-10-25 10:17 [PATCH v4 00/25] block layer: split block APIs in global state and I/O Emanuele Giuseppe Esposito
                   ` (9 preceding siblings ...)
  2021-10-25 10:17 ` [PATCH v4 10/25] assertions for blockjob_int.h Emanuele Giuseppe Esposito
@ 2021-10-25 10:17 ` Emanuele Giuseppe Esposito
  2021-10-25 10:17 ` [PATCH v4 12/25] assertions for blockob.h " Emanuele Giuseppe Esposito
                   ` (16 subsequent siblings)
  27 siblings, 0 replies; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-10-25 10:17 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Emanuele Giuseppe Esposito, Richard Henderson, Markus Armbruster,
	Dr. David Alan Gilbert, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Eric Blake

blockjob functions run always under the BQL lock.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/block/blockjob.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index d200f33c10..fa0c3f7a47 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -77,6 +77,13 @@ typedef struct BlockJob {
     GSList *nodes;
 } BlockJob;
 
+/*
+ * Global state (GS) API. These functions run under the BQL lock.
+ *
+ * See include/block/block-global-state.h for more information about
+ * the GS API.
+ */
+
 /**
  * block_job_next:
  * @job: A block job, or %NULL.
@@ -158,6 +165,8 @@ BlockJobInfo *block_job_query(BlockJob *job, Error **errp);
  */
 void block_job_iostatus_reset(BlockJob *job);
 
+/* Common functions that are neither I/O nor Global State */
+
 /**
  * block_job_is_internal:
  * @job: The job to determine if it is user-visible or not.
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 12/25] assertions for blockob.h global state API
  2021-10-25 10:17 [PATCH v4 00/25] block layer: split block APIs in global state and I/O Emanuele Giuseppe Esposito
                   ` (10 preceding siblings ...)
  2021-10-25 10:17 ` [PATCH v4 11/25] include/block/blockjob.h: global state API Emanuele Giuseppe Esposito
@ 2021-10-25 10:17 ` Emanuele Giuseppe Esposito
  2021-11-12 15:26   ` Hanna Reitz
  2021-10-25 10:17 ` [PATCH v4 13/25] include/sysemu/blockdev.h: move drive_add and inline drive_def Emanuele Giuseppe Esposito
                   ` (15 subsequent siblings)
  27 siblings, 1 reply; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-10-25 10:17 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Emanuele Giuseppe Esposito, Richard Henderson, Markus Armbruster,
	Dr. David Alan Gilbert, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Eric Blake

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 blockjob.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/blockjob.c b/blockjob.c
index fbd6c7d873..4982f6a2b5 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -61,6 +61,7 @@ static bool is_block_job(Job *job)
 
 BlockJob *block_job_next(BlockJob *bjob)
 {
+    assert(qemu_in_main_thread());
     Job *job = bjob ? &bjob->job : NULL;
 
     do {
@@ -72,6 +73,7 @@ BlockJob *block_job_next(BlockJob *bjob)
 
 BlockJob *block_job_get(const char *id)
 {
+    assert(qemu_in_main_thread());
     Job *job = job_get(id);
 
     if (job && is_block_job(job)) {
@@ -185,6 +187,7 @@ static const BdrvChildClass child_job = {
 
 void block_job_remove_all_bdrv(BlockJob *job)
 {
+    assert(qemu_in_main_thread());
     /*
      * bdrv_root_unref_child() may reach child_job_[can_]set_aio_ctx(),
      * which will also traverse job->nodes, so consume the list one by
@@ -207,6 +210,7 @@ void block_job_remove_all_bdrv(BlockJob *job)
 bool block_job_has_bdrv(BlockJob *job, BlockDriverState *bs)
 {
     GSList *el;
+    assert(qemu_in_main_thread());
 
     for (el = job->nodes; el; el = el->next) {
         BdrvChild *c = el->data;
@@ -223,6 +227,7 @@ int block_job_add_bdrv(BlockJob *job, const char *name, BlockDriverState *bs,
 {
     BdrvChild *c;
     bool need_context_ops;
+    assert(qemu_in_main_thread());
 
     bdrv_ref(bs);
 
@@ -272,6 +277,8 @@ bool block_job_set_speed(BlockJob *job, int64_t speed, Error **errp)
     const BlockJobDriver *drv = block_job_driver(job);
     int64_t old_speed = job->speed;
 
+    assert(qemu_in_main_thread());
+
     if (job_apply_verb(&job->job, JOB_VERB_SET_SPEED, errp) < 0) {
         return false;
     }
@@ -309,6 +316,8 @@ BlockJobInfo *block_job_query(BlockJob *job, Error **errp)
     BlockJobInfo *info;
     uint64_t progress_current, progress_total;
 
+    assert(qemu_in_main_thread());
+
     if (block_job_is_internal(job)) {
         error_setg(errp, "Cannot query QEMU internal jobs");
         return NULL;
@@ -498,6 +507,7 @@ void *block_job_create(const char *job_id, const BlockJobDriver *driver,
 
 void block_job_iostatus_reset(BlockJob *job)
 {
+    assert(qemu_in_main_thread());
     if (job->iostatus == BLOCK_DEVICE_IO_STATUS_OK) {
         return;
     }
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 13/25] include/sysemu/blockdev.h: move drive_add and inline drive_def
  2021-10-25 10:17 [PATCH v4 00/25] block layer: split block APIs in global state and I/O Emanuele Giuseppe Esposito
                   ` (11 preceding siblings ...)
  2021-10-25 10:17 ` [PATCH v4 12/25] assertions for blockob.h " Emanuele Giuseppe Esposito
@ 2021-10-25 10:17 ` Emanuele Giuseppe Esposito
  2021-11-12 15:41   ` Hanna Reitz
  2021-10-25 10:17 ` [PATCH v4 14/25] include/systemu/blockdev.h: global state API Emanuele Giuseppe Esposito
                   ` (14 subsequent siblings)
  27 siblings, 1 reply; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-10-25 10:17 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Emanuele Giuseppe Esposito, Richard Henderson, Markus Armbruster,
	Dr. David Alan Gilbert, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Eric Blake

drive_add is only used in softmmu/vl.c, so it can be a static
function there, and drive_def is only a particular use case of
qemu_opts_parse_noisily, so it can be inlined.

Also remove drive_mark_claimed_by_board, as it is only defined
but not implemented (nor used) anywhere.

This also helps simplifying next patch.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/monitor/block-hmp-cmds.c |  2 +-
 blockdev.c                     | 27 +--------------------------
 include/sysemu/blockdev.h      |  6 ++----
 softmmu/vl.c                   | 25 ++++++++++++++++++++++++-
 4 files changed, 28 insertions(+), 32 deletions(-)

diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
index 2ac4aedfff..bfb3c043a0 100644
--- a/block/monitor/block-hmp-cmds.c
+++ b/block/monitor/block-hmp-cmds.c
@@ -101,7 +101,7 @@ void hmp_drive_add(Monitor *mon, const QDict *qdict)
         return;
     }
 
-    opts = drive_def(optstr);
+    opts = qemu_opts_parse_noisily(qemu_find_opts("drive"), optstr, false);
     if (!opts)
         return;
 
diff --git a/blockdev.c b/blockdev.c
index c1f6171c6c..1bf49ef610 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -73,7 +73,7 @@ void bdrv_set_monitor_owned(BlockDriverState *bs)
     QTAILQ_INSERT_TAIL(&monitor_bdrv_states, bs, monitor_list);
 }
 
-static const char *const if_name[IF_COUNT] = {
+const char *const if_name[IF_COUNT] = {
     [IF_NONE] = "none",
     [IF_IDE] = "ide",
     [IF_SCSI] = "scsi",
@@ -199,31 +199,6 @@ static int drive_index_to_unit_id(BlockInterfaceType type, int index)
     return max_devs ? index % max_devs : index;
 }
 
-QemuOpts *drive_def(const char *optstr)
-{
-    return qemu_opts_parse_noisily(qemu_find_opts("drive"), optstr, false);
-}
-
-QemuOpts *drive_add(BlockInterfaceType type, int index, const char *file,
-                    const char *optstr)
-{
-    QemuOpts *opts;
-
-    opts = drive_def(optstr);
-    if (!opts) {
-        return NULL;
-    }
-    if (type != IF_DEFAULT) {
-        qemu_opt_set(opts, "if", if_name[type], &error_abort);
-    }
-    if (index >= 0) {
-        qemu_opt_set_number(opts, "index", index, &error_abort);
-    }
-    if (file)
-        qemu_opt_set(opts, "file", file, &error_abort);
-    return opts;
-}
-
 DriveInfo *drive_get(BlockInterfaceType type, int bus, int unit)
 {
     BlockBackend *blk;
diff --git a/include/sysemu/blockdev.h b/include/sysemu/blockdev.h
index 32c2d6023c..960b54d320 100644
--- a/include/sysemu/blockdev.h
+++ b/include/sysemu/blockdev.h
@@ -27,6 +27,8 @@ typedef enum {
     IF_COUNT
 } BlockInterfaceType;
 
+extern const char *const if_name[];
+
 struct DriveInfo {
     BlockInterfaceType type;
     int bus;
@@ -45,16 +47,12 @@ BlockBackend *blk_by_legacy_dinfo(DriveInfo *dinfo);
 void override_max_devs(BlockInterfaceType type, int max_devs);
 
 DriveInfo *drive_get(BlockInterfaceType type, int bus, int unit);
-void drive_mark_claimed_by_board(void);
 void drive_check_orphaned(void);
 DriveInfo *drive_get_by_index(BlockInterfaceType type, int index);
 int drive_get_max_bus(BlockInterfaceType type);
 int drive_get_max_devs(BlockInterfaceType type);
 DriveInfo *drive_get_next(BlockInterfaceType type);
 
-QemuOpts *drive_def(const char *optstr);
-QemuOpts *drive_add(BlockInterfaceType type, int index, const char *file,
-                    const char *optstr);
 DriveInfo *drive_new(QemuOpts *arg, BlockInterfaceType block_default_type,
                      Error **errp);
 
diff --git a/softmmu/vl.c b/softmmu/vl.c
index af0c4cbd99..6938798bdb 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -650,6 +650,27 @@ static int drive_enable_snapshot(void *opaque, QemuOpts *opts, Error **errp)
     return 0;
 }
 
+static QemuOpts *drive_add(BlockInterfaceType type, int index,
+                           const char *file, const char *optstr)
+{
+    QemuOpts *opts;
+
+    opts = qemu_opts_parse_noisily(qemu_find_opts("drive"), optstr, false);
+    if (!opts) {
+        return NULL;
+    }
+    if (type != IF_DEFAULT) {
+        qemu_opt_set(opts, "if", if_name[type], &error_abort);
+    }
+    if (index >= 0) {
+        qemu_opt_set_number(opts, "index", index, &error_abort);
+    }
+    if (file) {
+        qemu_opt_set(opts, "file", file, &error_abort);
+    }
+    return opts;
+}
+
 static void default_drive(int enable, int snapshot, BlockInterfaceType type,
                           int index, const char *optstr)
 {
@@ -2884,7 +2905,9 @@ void qemu_init(int argc, char **argv, char **envp)
                     break;
                 }
             case QEMU_OPTION_drive:
-                if (drive_def(optarg) == NULL) {
+                opts = qemu_opts_parse_noisily(qemu_find_opts("drive"),
+                                               optarg, false);
+                if (opts == NULL) {
                     exit(1);
                 }
                 break;
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 14/25] include/systemu/blockdev.h: global state API
  2021-10-25 10:17 [PATCH v4 00/25] block layer: split block APIs in global state and I/O Emanuele Giuseppe Esposito
                   ` (12 preceding siblings ...)
  2021-10-25 10:17 ` [PATCH v4 13/25] include/sysemu/blockdev.h: move drive_add and inline drive_def Emanuele Giuseppe Esposito
@ 2021-10-25 10:17 ` Emanuele Giuseppe Esposito
  2021-10-28 15:48   ` Stefan Hajnoczi
  2021-11-12 15:46   ` Hanna Reitz
  2021-10-25 10:17 ` [PATCH v4 15/25] assertions for blockdev.h " Emanuele Giuseppe Esposito
                   ` (13 subsequent siblings)
  27 siblings, 2 replies; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-10-25 10:17 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Emanuele Giuseppe Esposito, Richard Henderson, Markus Armbruster,
	Dr. David Alan Gilbert, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Eric Blake

blockdev functions run always under the BQL lock.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 include/sysemu/blockdev.h | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/include/sysemu/blockdev.h b/include/sysemu/blockdev.h
index 960b54d320..b07f15df09 100644
--- a/include/sysemu/blockdev.h
+++ b/include/sysemu/blockdev.h
@@ -13,9 +13,6 @@
 #include "block/block.h"
 #include "qemu/queue.h"
 
-void blockdev_mark_auto_del(BlockBackend *blk);
-void blockdev_auto_del(BlockBackend *blk);
-
 typedef enum {
     IF_DEFAULT = -1,            /* for use with drive_add() only */
     /*
@@ -40,6 +37,16 @@ struct DriveInfo {
     QTAILQ_ENTRY(DriveInfo) next;
 };
 
+/*
+ * Global state (GS) API. These functions run under the BQL lock.
+ *
+ * See include/block/block-global-state.h for more information about
+ * the GS API.
+ */
+
+void blockdev_mark_auto_del(BlockBackend *blk);
+void blockdev_auto_del(BlockBackend *blk);
+
 DriveInfo *blk_legacy_dinfo(BlockBackend *blk);
 DriveInfo *blk_set_legacy_dinfo(BlockBackend *blk, DriveInfo *dinfo);
 BlockBackend *blk_by_legacy_dinfo(DriveInfo *dinfo);
@@ -50,10 +57,13 @@ DriveInfo *drive_get(BlockInterfaceType type, int bus, int unit);
 void drive_check_orphaned(void);
 DriveInfo *drive_get_by_index(BlockInterfaceType type, int index);
 int drive_get_max_bus(BlockInterfaceType type);
-int drive_get_max_devs(BlockInterfaceType type);
 DriveInfo *drive_get_next(BlockInterfaceType type);
 
 DriveInfo *drive_new(QemuOpts *arg, BlockInterfaceType block_default_type,
                      Error **errp);
 
+/* Common functions that are neither I/O nor Global State */
+
+int drive_get_max_devs(BlockInterfaceType type);
+
 #endif
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 15/25] assertions for blockdev.h global state API
  2021-10-25 10:17 [PATCH v4 00/25] block layer: split block APIs in global state and I/O Emanuele Giuseppe Esposito
                   ` (13 preceding siblings ...)
  2021-10-25 10:17 ` [PATCH v4 14/25] include/systemu/blockdev.h: global state API Emanuele Giuseppe Esposito
@ 2021-10-25 10:17 ` Emanuele Giuseppe Esposito
  2021-10-25 10:17 ` [PATCH v4 16/25] include/block/snapshot: global state API + assertions Emanuele Giuseppe Esposito
                   ` (12 subsequent siblings)
  27 siblings, 0 replies; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-10-25 10:17 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Emanuele Giuseppe Esposito, Richard Henderson, Markus Armbruster,
	Dr. David Alan Gilbert, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Eric Blake

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/block-backend.c |  3 +++
 blockdev.c            | 15 +++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/block/block-backend.c b/block/block-backend.c
index fa30bb88ea..e5e4f4e9b9 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -795,6 +795,7 @@ bool bdrv_is_root_node(BlockDriverState *bs)
  */
 DriveInfo *blk_legacy_dinfo(BlockBackend *blk)
 {
+    assert(qemu_in_main_thread());
     return blk->legacy_dinfo;
 }
 
@@ -806,6 +807,7 @@ DriveInfo *blk_legacy_dinfo(BlockBackend *blk)
 DriveInfo *blk_set_legacy_dinfo(BlockBackend *blk, DriveInfo *dinfo)
 {
     assert(!blk->legacy_dinfo);
+    assert(qemu_in_main_thread());
     return blk->legacy_dinfo = dinfo;
 }
 
@@ -816,6 +818,7 @@ DriveInfo *blk_set_legacy_dinfo(BlockBackend *blk, DriveInfo *dinfo)
 BlockBackend *blk_by_legacy_dinfo(DriveInfo *dinfo)
 {
     BlockBackend *blk = NULL;
+    assert(qemu_in_main_thread());
 
     while ((blk = blk_next(blk)) != NULL) {
         if (blk->legacy_dinfo == dinfo) {
diff --git a/blockdev.c b/blockdev.c
index 1bf49ef610..d9bf842a81 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -113,6 +113,8 @@ void override_max_devs(BlockInterfaceType type, int max_devs)
     BlockBackend *blk;
     DriveInfo *dinfo;
 
+    assert(qemu_in_main_thread());
+
     if (max_devs <= 0) {
         return;
     }
@@ -142,6 +144,8 @@ void blockdev_mark_auto_del(BlockBackend *blk)
     DriveInfo *dinfo = blk_legacy_dinfo(blk);
     BlockJob *job;
 
+    assert(qemu_in_main_thread());
+
     if (!dinfo) {
         return;
     }
@@ -163,6 +167,7 @@ void blockdev_mark_auto_del(BlockBackend *blk)
 void blockdev_auto_del(BlockBackend *blk)
 {
     DriveInfo *dinfo = blk_legacy_dinfo(blk);
+    assert(qemu_in_main_thread());
 
     if (dinfo && dinfo->auto_del) {
         monitor_remove_blk(blk);
@@ -204,6 +209,8 @@ DriveInfo *drive_get(BlockInterfaceType type, int bus, int unit)
     BlockBackend *blk;
     DriveInfo *dinfo;
 
+    assert(qemu_in_main_thread());
+
     for (blk = blk_next(NULL); blk; blk = blk_next(blk)) {
         dinfo = blk_legacy_dinfo(blk);
         if (dinfo && dinfo->type == type
@@ -226,6 +233,8 @@ void drive_check_orphaned(void)
     Location loc;
     bool orphans = false;
 
+    assert(qemu_in_main_thread());
+
     for (blk = blk_next(NULL); blk; blk = blk_next(blk)) {
         dinfo = blk_legacy_dinfo(blk);
         /*
@@ -259,6 +268,7 @@ void drive_check_orphaned(void)
 
 DriveInfo *drive_get_by_index(BlockInterfaceType type, int index)
 {
+    assert(qemu_in_main_thread());
     return drive_get(type,
                      drive_index_to_bus_id(type, index),
                      drive_index_to_unit_id(type, index));
@@ -270,6 +280,8 @@ int drive_get_max_bus(BlockInterfaceType type)
     BlockBackend *blk;
     DriveInfo *dinfo;
 
+    assert(qemu_in_main_thread());
+
     max_bus = -1;
     for (blk = blk_next(NULL); blk; blk = blk_next(blk)) {
         dinfo = blk_legacy_dinfo(blk);
@@ -286,6 +298,7 @@ int drive_get_max_bus(BlockInterfaceType type)
 DriveInfo *drive_get_next(BlockInterfaceType type)
 {
     static int next_block_unit[IF_COUNT];
+    assert(qemu_in_main_thread());
 
     return drive_get(type, 0, next_block_unit[type]++);
 }
@@ -766,6 +779,8 @@ DriveInfo *drive_new(QemuOpts *all_opts, BlockInterfaceType block_default_type,
     const char *filename;
     int i;
 
+    assert(qemu_in_main_thread());
+
     /* Change legacy command line options into QMP ones */
     static const struct {
         const char *from;
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 16/25] include/block/snapshot: global state API + assertions
  2021-10-25 10:17 [PATCH v4 00/25] block layer: split block APIs in global state and I/O Emanuele Giuseppe Esposito
                   ` (14 preceding siblings ...)
  2021-10-25 10:17 ` [PATCH v4 15/25] assertions for blockdev.h " Emanuele Giuseppe Esposito
@ 2021-10-25 10:17 ` Emanuele Giuseppe Esposito
  2021-10-25 10:17 ` [PATCH v4 17/25] block/copy-before-write.h: " Emanuele Giuseppe Esposito
                   ` (11 subsequent siblings)
  27 siblings, 0 replies; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-10-25 10:17 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Emanuele Giuseppe Esposito, Richard Henderson, Markus Armbruster,
	Dr. David Alan Gilbert, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Eric Blake

Snapshots run also under the BQL lock, so they all are
in the global state API. The aiocontext lock that they hold
is currently an overkill and in future could be removed.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/snapshot.c         | 28 ++++++++++++++++++++++++++++
 include/block/snapshot.h | 13 +++++++++++--
 migration/savevm.c       |  2 ++
 3 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/block/snapshot.c b/block/snapshot.c
index ccacda8bd5..3dd3c9d3bd 100644
--- a/block/snapshot.c
+++ b/block/snapshot.c
@@ -57,6 +57,8 @@ int bdrv_snapshot_find(BlockDriverState *bs, QEMUSnapshotInfo *sn_info,
     QEMUSnapshotInfo *sn_tab, *sn;
     int nb_sns, i, ret;
 
+    assert(qemu_in_main_thread());
+
     ret = -ENOENT;
     nb_sns = bdrv_snapshot_list(bs, &sn_tab);
     if (nb_sns < 0) {
@@ -105,6 +107,7 @@ bool bdrv_snapshot_find_by_id_and_name(BlockDriverState *bs,
     bool ret = false;
 
     assert(id || name);
+    assert(qemu_in_main_thread());
 
     nb_sns = bdrv_snapshot_list(bs, &sn_tab);
     if (nb_sns < 0) {
@@ -200,6 +203,7 @@ static BlockDriverState *bdrv_snapshot_fallback(BlockDriverState *bs)
 int bdrv_can_snapshot(BlockDriverState *bs)
 {
     BlockDriver *drv = bs->drv;
+    assert(qemu_in_main_thread());
     if (!drv || !bdrv_is_inserted(bs) || bdrv_is_read_only(bs)) {
         return 0;
     }
@@ -220,6 +224,9 @@ int bdrv_snapshot_create(BlockDriverState *bs,
 {
     BlockDriver *drv = bs->drv;
     BlockDriverState *fallback_bs = bdrv_snapshot_fallback(bs);
+
+    assert(qemu_in_main_thread());
+
     if (!drv) {
         return -ENOMEDIUM;
     }
@@ -240,6 +247,8 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
     BdrvChild **fallback_ptr;
     int ret, open_ret;
 
+    assert(qemu_in_main_thread());
+
     if (!drv) {
         error_setg(errp, "Block driver is closed");
         return -ENOMEDIUM;
@@ -348,6 +357,8 @@ int bdrv_snapshot_delete(BlockDriverState *bs,
     BlockDriverState *fallback_bs = bdrv_snapshot_fallback(bs);
     int ret;
 
+    assert(qemu_in_main_thread());
+
     if (!drv) {
         error_setg(errp, QERR_DEVICE_HAS_NO_MEDIUM, bdrv_get_device_name(bs));
         return -ENOMEDIUM;
@@ -380,6 +391,8 @@ int bdrv_snapshot_list(BlockDriverState *bs,
 {
     BlockDriver *drv = bs->drv;
     BlockDriverState *fallback_bs = bdrv_snapshot_fallback(bs);
+
+    assert(qemu_in_main_thread());
     if (!drv) {
         return -ENOMEDIUM;
     }
@@ -419,6 +432,8 @@ int bdrv_snapshot_load_tmp(BlockDriverState *bs,
 {
     BlockDriver *drv = bs->drv;
 
+    assert(qemu_in_main_thread());
+
     if (!drv) {
         error_setg(errp, QERR_DEVICE_HAS_NO_MEDIUM, bdrv_get_device_name(bs));
         return -ENOMEDIUM;
@@ -447,6 +462,8 @@ int bdrv_snapshot_load_tmp_by_id_or_name(BlockDriverState *bs,
     int ret;
     Error *local_err = NULL;
 
+    assert(qemu_in_main_thread());
+
     ret = bdrv_snapshot_load_tmp(bs, id_or_name, NULL, &local_err);
     if (ret == -ENOENT || ret == -EINVAL) {
         error_free(local_err);
@@ -515,6 +532,8 @@ bool bdrv_all_can_snapshot(bool has_devices, strList *devices,
     g_autoptr(GList) bdrvs = NULL;
     GList *iterbdrvs;
 
+    assert(qemu_in_main_thread());
+
     if (bdrv_all_get_snapshot_devices(has_devices, devices, &bdrvs, errp) < 0) {
         return false;
     }
@@ -549,6 +568,8 @@ int bdrv_all_delete_snapshot(const char *name,
     g_autoptr(GList) bdrvs = NULL;
     GList *iterbdrvs;
 
+    assert(qemu_in_main_thread());
+
     if (bdrv_all_get_snapshot_devices(has_devices, devices, &bdrvs, errp) < 0) {
         return -1;
     }
@@ -588,6 +609,8 @@ int bdrv_all_goto_snapshot(const char *name,
     g_autoptr(GList) bdrvs = NULL;
     GList *iterbdrvs;
 
+    assert(qemu_in_main_thread());
+
     if (bdrv_all_get_snapshot_devices(has_devices, devices, &bdrvs, errp) < 0) {
         return -1;
     }
@@ -622,6 +645,8 @@ int bdrv_all_has_snapshot(const char *name,
     g_autoptr(GList) bdrvs = NULL;
     GList *iterbdrvs;
 
+    assert(qemu_in_main_thread());
+
     if (bdrv_all_get_snapshot_devices(has_devices, devices, &bdrvs, errp) < 0) {
         return -1;
     }
@@ -663,6 +688,7 @@ int bdrv_all_create_snapshot(QEMUSnapshotInfo *sn,
 {
     g_autoptr(GList) bdrvs = NULL;
     GList *iterbdrvs;
+    assert(qemu_in_main_thread());
 
     if (bdrv_all_get_snapshot_devices(has_devices, devices, &bdrvs, errp) < 0) {
         return -1;
@@ -703,6 +729,8 @@ BlockDriverState *bdrv_all_find_vmstate_bs(const char *vmstate_bs,
     g_autoptr(GList) bdrvs = NULL;
     GList *iterbdrvs;
 
+    assert(qemu_in_main_thread());
+
     if (bdrv_all_get_snapshot_devices(has_devices, devices, &bdrvs, errp) < 0) {
         return NULL;
     }
diff --git a/include/block/snapshot.h b/include/block/snapshot.h
index 940345692f..cda82c1ba1 100644
--- a/include/block/snapshot.h
+++ b/include/block/snapshot.h
@@ -45,6 +45,13 @@ typedef struct QEMUSnapshotInfo {
     uint64_t icount; /* record/replay step */
 } QEMUSnapshotInfo;
 
+/*
+ * Global state (GS) API. These functions run under the BQL lock.
+ *
+ * See include/block/block-global-state.h for more information about
+ * the GS API.
+ */
+
 int bdrv_snapshot_find(BlockDriverState *bs, QEMUSnapshotInfo *sn_info,
                        const char *name);
 bool bdrv_snapshot_find_by_id_and_name(BlockDriverState *bs,
@@ -73,9 +80,11 @@ int bdrv_snapshot_load_tmp_by_id_or_name(BlockDriverState *bs,
                                          Error **errp);
 
 
-/* Group operations. All block drivers are involved.
+/*
+ * Group operations. All block drivers are involved.
  * These functions will properly handle dataplane (take aio_context_acquire
- * when appropriate for appropriate block drivers */
+ * when appropriate for appropriate block drivers
+ */
 
 bool bdrv_all_can_snapshot(bool has_devices, strList *devices,
                            Error **errp);
diff --git a/migration/savevm.c b/migration/savevm.c
index 7b7b64bd13..5dd740f8e7 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2785,6 +2785,8 @@ bool save_snapshot(const char *name, bool overwrite, const char *vmstate,
     g_autoptr(GDateTime) now = g_date_time_new_now_local();
     AioContext *aio_context;
 
+    assert(qemu_in_main_thread());
+
     if (migration_is_blocked(errp)) {
         return false;
     }
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 17/25] block/copy-before-write.h: global state API + assertions
  2021-10-25 10:17 [PATCH v4 00/25] block layer: split block APIs in global state and I/O Emanuele Giuseppe Esposito
                   ` (15 preceding siblings ...)
  2021-10-25 10:17 ` [PATCH v4 16/25] include/block/snapshot: global state API + assertions Emanuele Giuseppe Esposito
@ 2021-10-25 10:17 ` Emanuele Giuseppe Esposito
  2021-10-25 10:17 ` [PATCH v4 18/25] block/coroutines: I/O API Emanuele Giuseppe Esposito
                   ` (10 subsequent siblings)
  27 siblings, 0 replies; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-10-25 10:17 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Emanuele Giuseppe Esposito, Richard Henderson, Markus Armbruster,
	Dr. David Alan Gilbert, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Eric Blake

copy-before-write functions always run under BQL lock.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/copy-before-write.c | 2 ++
 block/copy-before-write.h | 7 +++++++
 2 files changed, 9 insertions(+)

diff --git a/block/copy-before-write.c b/block/copy-before-write.c
index c30a5ff8de..36a8d7ba52 100644
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c
@@ -223,6 +223,7 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source,
     QDict *opts;
 
     assert(source->total_sectors == target->total_sectors);
+    assert(qemu_in_main_thread());
 
     opts = qdict_new();
     qdict_put_str(opts, "driver", "copy-before-write");
@@ -245,6 +246,7 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source,
 
 void bdrv_cbw_drop(BlockDriverState *bs)
 {
+    assert(qemu_in_main_thread());
     bdrv_drop_filter(bs, &error_abort);
     bdrv_unref(bs);
 }
diff --git a/block/copy-before-write.h b/block/copy-before-write.h
index 51847e711a..9a45de2fce 100644
--- a/block/copy-before-write.h
+++ b/block/copy-before-write.h
@@ -29,6 +29,13 @@
 #include "block/block_int.h"
 #include "block/block-copy.h"
 
+/*
+ * Global state (GS) API. These functions run under the BQL lock.
+ *
+ * See include/block/block-global-state.h for more information about
+ * the GS API.
+ */
+
 BlockDriverState *bdrv_cbw_append(BlockDriverState *source,
                                   BlockDriverState *target,
                                   const char *filter_node_name,
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 18/25] block/coroutines: I/O API
  2021-10-25 10:17 [PATCH v4 00/25] block layer: split block APIs in global state and I/O Emanuele Giuseppe Esposito
                   ` (16 preceding siblings ...)
  2021-10-25 10:17 ` [PATCH v4 17/25] block/copy-before-write.h: " Emanuele Giuseppe Esposito
@ 2021-10-25 10:17 ` Emanuele Giuseppe Esposito
  2021-10-25 10:17 ` [PATCH v4 19/25] block_int-common.h: split function pointers in BlockDriver Emanuele Giuseppe Esposito
                   ` (9 subsequent siblings)
  27 siblings, 0 replies; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-10-25 10:17 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Emanuele Giuseppe Esposito, Richard Henderson, Markus Armbruster,
	Dr. David Alan Gilbert, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Eric Blake

block coroutines functions run in different aiocontext, and are
not protected by the BQL. Therefore are I/O.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/coroutines.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/block/coroutines.h b/block/coroutines.h
index c8c14a29c8..c61abd271a 100644
--- a/block/coroutines.h
+++ b/block/coroutines.h
@@ -29,6 +29,12 @@
 
 /* For blk_bs() in generated block/block-gen.c */
 #include "sysemu/block-backend.h"
+/*
+ * I/O API functions. These functions are thread-safe.
+ *
+ * See include/block/block-io.h for more information about
+ * the I/O API.
+ */
 
 int coroutine_fn bdrv_co_check(BlockDriverState *bs,
                                BdrvCheckResult *res, BdrvCheckMode fix);
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 19/25] block_int-common.h: split function pointers in BlockDriver
  2021-10-25 10:17 [PATCH v4 00/25] block layer: split block APIs in global state and I/O Emanuele Giuseppe Esposito
                   ` (17 preceding siblings ...)
  2021-10-25 10:17 ` [PATCH v4 18/25] block/coroutines: I/O API Emanuele Giuseppe Esposito
@ 2021-10-25 10:17 ` Emanuele Giuseppe Esposito
  2021-11-15 12:00   ` Hanna Reitz
  2021-10-25 10:17 ` [PATCH v4 20/25] block_int-common.h: assertion in the callers of BlockDriver function pointers Emanuele Giuseppe Esposito
                   ` (8 subsequent siblings)
  27 siblings, 1 reply; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-10-25 10:17 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Emanuele Giuseppe Esposito, Richard Henderson, Markus Armbruster,
	Dr. David Alan Gilbert, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Eric Blake

Similar to the header split, also the function pointers in BlockDriver
can be split in I/O and global state.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/block/block_int-common.h | 458 ++++++++++++++++---------------
 1 file changed, 237 insertions(+), 221 deletions(-)

diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 79a3d801d2..9857e775fe 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -96,6 +96,7 @@ typedef struct BdrvTrackedRequest {
 
 
 struct BlockDriver {
+    /* Fields initialized in struct definition and never changed. */
     const char *format_name;
     int instance_size;
 
@@ -121,23 +122,7 @@ struct BlockDriver {
      * on those children.
      */
     bool is_format;
-    /*
-     * Return true if @to_replace can be replaced by a BDS with the
-     * same data as @bs without it affecting @bs's behavior (that is,
-     * without it being visible to @bs's parents).
-     */
-    bool (*bdrv_recurse_can_replace)(BlockDriverState *bs,
-                                     BlockDriverState *to_replace);
 
-    int (*bdrv_probe)(const uint8_t *buf, int buf_size, const char *filename);
-    int (*bdrv_probe_device)(const char *filename);
-
-    /*
-     * Any driver implementing this callback is expected to be able to handle
-     * NULL file names in its .bdrv_open() implementation.
-     */
-    void (*bdrv_parse_filename)(const char *filename, QDict *options,
-                                Error **errp);
     /*
      * Drivers not implementing bdrv_parse_filename nor bdrv_open should have
      * this field set to true, except ones that are defined only by their
@@ -159,7 +144,66 @@ struct BlockDriver {
      */
     bool supports_backing;
 
-    /* For handling image reopen for split or non-split files */
+    /*
+     * Drivers setting this field must be able to work with just a plain
+     * filename with '<protocol_name>:' as a prefix, and no other options.
+     * Options may be extracted from the filename by implementing
+     * bdrv_parse_filename.
+     */
+    const char *protocol_name;
+
+    /* List of options for creating images, terminated by name == NULL */
+    QemuOptsList *create_opts;
+
+    /* List of options for image amend */
+    QemuOptsList *amend_opts;
+
+    /*
+     * If this driver supports reopening images this contains a
+     * NULL-terminated list of the runtime options that can be
+     * modified. If an option in this list is unspecified during
+     * reopen then it _must_ be reset to its default value or return
+     * an error.
+     */
+    const char *const *mutable_opts;
+
+    /*
+     * Pointer to a NULL-terminated array of names of strong options
+     * that can be specified for bdrv_open(). A strong option is one
+     * that changes the data of a BDS.
+     * If this pointer is NULL, the array is considered empty.
+     * "filename" and "driver" are always considered strong.
+     */
+    const char *const *strong_runtime_opts;
+
+    /*
+     * Global state (GS) API. These functions run under the BQL lock.
+     *
+     * See include/block/block-global-state.h for more information about
+     * the GS API.
+     */
+
+    /*
+     * Return true if @to_replace can be replaced by a BDS with the
+     * same data as @bs without it affecting @bs's behavior (that is,
+     * without it being visible to @bs's parents).
+     */
+    bool (*bdrv_recurse_can_replace)(BlockDriverState *bs,
+                                     BlockDriverState *to_replace);
+
+    int (*bdrv_probe)(const uint8_t *buf, int buf_size, const char *filename);
+    int (*bdrv_probe_device)(const char *filename);
+
+    /*
+     * Any driver implementing this callback is expected to be able to handle
+     * NULL file names in its .bdrv_open() implementation.
+     */
+    void (*bdrv_parse_filename)(const char *filename, QDict *options,
+                                Error **errp);
+
+    /*
+     * For handling image reopen for split or non-split files.
+     */
     int (*bdrv_reopen_prepare)(BDRVReopenState *reopen_state,
                                BlockReopenQueue *queue, Error **errp);
     void (*bdrv_reopen_commit)(BDRVReopenState *reopen_state);
@@ -175,19 +219,6 @@ struct BlockDriver {
                           Error **errp);
     void (*bdrv_close)(BlockDriverState *bs);
 
-
-    int coroutine_fn (*bdrv_co_create)(BlockdevCreateOptions *opts,
-                                       Error **errp);
-    int coroutine_fn (*bdrv_co_create_opts)(BlockDriver *drv,
-                                            const char *filename,
-                                            QemuOpts *opts,
-                                            Error **errp);
-
-    int coroutine_fn (*bdrv_co_amend)(BlockDriverState *bs,
-                                      BlockdevAmendOptions *opts,
-                                      bool force,
-                                      Error **errp);
-
     int (*bdrv_amend_options)(BlockDriverState *bs,
                               QemuOpts *opts,
                               BlockDriverAmendStatusCB *status_cb,
@@ -234,6 +265,182 @@ struct BlockDriver {
      */
     char *(*bdrv_dirname)(BlockDriverState *bs, Error **errp);
 
+    /*
+     * This informs the driver that we are no longer interested in the result
+     * of in-flight requests, so don't waste the time if possible.
+     *
+     * One example usage is to avoid waiting for an nbd target node reconnect
+     * timeout during job-cancel with force=true.
+     */
+    void (*bdrv_cancel_in_flight)(BlockDriverState *bs);
+
+    int (*bdrv_inactivate)(BlockDriverState *bs);
+
+    int (*bdrv_snapshot_create)(BlockDriverState *bs,
+                                QEMUSnapshotInfo *sn_info);
+    int (*bdrv_snapshot_goto)(BlockDriverState *bs,
+                              const char *snapshot_id);
+    int (*bdrv_snapshot_delete)(BlockDriverState *bs,
+                                const char *snapshot_id,
+                                const char *name,
+                                Error **errp);
+    int (*bdrv_snapshot_list)(BlockDriverState *bs,
+                              QEMUSnapshotInfo **psn_info);
+    int (*bdrv_snapshot_load_tmp)(BlockDriverState *bs,
+                                  const char *snapshot_id,
+                                  const char *name,
+                                  Error **errp);
+
+    int coroutine_fn (*bdrv_save_vmstate)(BlockDriverState *bs,
+                                          QEMUIOVector *qiov,
+                                          int64_t pos);
+    int coroutine_fn (*bdrv_load_vmstate)(BlockDriverState *bs,
+                                          QEMUIOVector *qiov,
+                                          int64_t pos);
+
+    int (*bdrv_change_backing_file)(BlockDriverState *bs,
+        const char *backing_file, const char *backing_fmt);
+
+    void (*bdrv_eject)(BlockDriverState *bs, bool eject_flag);
+    void (*bdrv_lock_medium)(BlockDriverState *bs, bool locked);
+
+    /* TODO Better pass a option string/QDict/QemuOpts to add any rule? */
+    int (*bdrv_debug_breakpoint)(BlockDriverState *bs, const char *event,
+        const char *tag);
+    int (*bdrv_debug_remove_breakpoint)(BlockDriverState *bs,
+        const char *tag);
+    int (*bdrv_debug_resume)(BlockDriverState *bs, const char *tag);
+    bool (*bdrv_debug_is_suspended)(BlockDriverState *bs, const char *tag);
+    void (*bdrv_refresh_limits)(BlockDriverState *bs, Error **errp);
+
+    /*
+     * Returns 1 if newly created images are guaranteed to contain only
+     * zeros, 0 otherwise.
+     */
+    int (*bdrv_has_zero_init)(BlockDriverState *bs);
+
+    /*
+     * Remove fd handlers, timers, and other event loop callbacks so the event
+     * loop is no longer in use.  Called with no in-flight requests and in
+     * depth-first traversal order with parents before child nodes.
+     */
+    void (*bdrv_detach_aio_context)(BlockDriverState *bs);
+
+    /*
+     * Add fd handlers, timers, and other event loop callbacks so I/O requests
+     * can be processed again.  Called with no in-flight requests and in
+     * depth-first traversal order with child nodes before parent nodes.
+     */
+    void (*bdrv_attach_aio_context)(BlockDriverState *bs,
+                                    AioContext *new_context);
+
+    /**
+     * Try to get @bs's logical and physical block size.
+     * On success, store them in @bsz and return zero.
+     * On failure, return negative errno.
+     */
+    int (*bdrv_probe_blocksizes)(BlockDriverState *bs, BlockSizes *bsz);
+    /**
+     * Try to get @bs's geometry (cyls, heads, sectors)
+     * On success, store them in @geo and return 0.
+     * On failure return -errno.
+     * Only drivers that want to override guest geometry implement this
+     * callback; see hd_geometry_guess().
+     */
+    int (*bdrv_probe_geometry)(BlockDriverState *bs, HDGeometry *geo);
+
+    void (*bdrv_add_child)(BlockDriverState *parent, BlockDriverState *child,
+                           Error **errp);
+    void (*bdrv_del_child)(BlockDriverState *parent, BdrvChild *child,
+                           Error **errp);
+
+    /**
+     * Informs the block driver that a permission change is intended. The
+     * driver checks whether the change is permissible and may take other
+     * preparations for the change (e.g. get file system locks). This operation
+     * is always followed either by a call to either .bdrv_set_perm or
+     * .bdrv_abort_perm_update.
+     *
+     * Checks whether the requested set of cumulative permissions in @perm
+     * can be granted for accessing @bs and whether no other users are using
+     * permissions other than those given in @shared (both arguments take
+     * BLK_PERM_* bitmasks).
+     *
+     * If both conditions are met, 0 is returned. Otherwise, -errno is returned
+     * and errp is set to an error describing the conflict.
+     */
+    int (*bdrv_check_perm)(BlockDriverState *bs, uint64_t perm,
+                           uint64_t shared, Error **errp);
+
+    /**
+     * Called to inform the driver that the set of cumulative set of used
+     * permissions for @bs has changed to @perm, and the set of sharable
+     * permission to @shared. The driver can use this to propagate changes to
+     * its children (i.e. request permissions only if a parent actually needs
+     * them).
+     *
+     * This function is only invoked after bdrv_check_perm(), so block drivers
+     * may rely on preparations made in their .bdrv_check_perm implementation.
+     */
+    void (*bdrv_set_perm)(BlockDriverState *bs, uint64_t perm, uint64_t shared);
+
+    /*
+     * Called to inform the driver that after a previous bdrv_check_perm()
+     * call, the permission update is not performed and any preparations made
+     * for it (e.g. taken file locks) need to be undone.
+     *
+     * This function can be called even for nodes that never saw a
+     * bdrv_check_perm() call. It is a no-op then.
+     */
+    void (*bdrv_abort_perm_update)(BlockDriverState *bs);
+
+    /**
+     * Returns in @nperm and @nshared the permissions that the driver for @bs
+     * needs on its child @c, based on the cumulative permissions requested by
+     * the parents in @parent_perm and @parent_shared.
+     *
+     * If @c is NULL, return the permissions for attaching a new child for the
+     * given @child_class and @role.
+     *
+     * If @reopen_queue is non-NULL, don't return the currently needed
+     * permissions, but those that will be needed after applying the
+     * @reopen_queue.
+     */
+     void (*bdrv_child_perm)(BlockDriverState *bs, BdrvChild *c,
+                             BdrvChildRole role,
+                             BlockReopenQueue *reopen_queue,
+                             uint64_t parent_perm, uint64_t parent_shared,
+                             uint64_t *nperm, uint64_t *nshared);
+
+    /**
+     * Register/unregister a buffer for I/O. For example, when the driver is
+     * interested to know the memory areas that will later be used in iovs, so
+     * that it can do IOMMU mapping with VFIO etc., in order to get better
+     * performance. In the case of VFIO drivers, this callback is used to do
+     * DMA mapping for hot buffers.
+     */
+    void (*bdrv_register_buf)(BlockDriverState *bs, void *host, size_t size);
+    void (*bdrv_unregister_buf)(BlockDriverState *bs, void *host);
+    QLIST_ENTRY(BlockDriver) list;
+
+    /*
+     * I/O API functions. These functions are thread-safe.
+     *
+     * See include/block/block-io.h for more information about
+     * the I/O API.
+     */
+
+    int coroutine_fn (*bdrv_co_create)(BlockdevCreateOptions *opts,
+                                       Error **errp);
+    int coroutine_fn (*bdrv_co_create_opts)(BlockDriver *drv,
+                                            const char *filename,
+                                            QemuOpts *opts,
+                                            Error **errp);
+    int coroutine_fn (*bdrv_co_amend)(BlockDriverState *bs,
+                                      BlockdevAmendOptions *opts,
+                                      bool force,
+                                      Error **errp);
+
     /* aio */
     BlockAIOCB *(*bdrv_aio_preadv)(BlockDriverState *bs,
         int64_t offset, int64_t bytes, QEMUIOVector *qiov,
@@ -374,21 +581,11 @@ struct BlockDriver {
         bool want_zero, int64_t offset, int64_t bytes, int64_t *pnum,
         int64_t *map, BlockDriverState **file);
 
-    /*
-     * This informs the driver that we are no longer interested in the result
-     * of in-flight requests, so don't waste the time if possible.
-     *
-     * One example usage is to avoid waiting for an nbd target node reconnect
-     * timeout during job-cancel with force=true.
-     */
-    void (*bdrv_cancel_in_flight)(BlockDriverState *bs);
-
     /*
      * Invalidate any cached meta-data.
      */
     void coroutine_fn (*bdrv_co_invalidate_cache)(BlockDriverState *bs,
                                                   Error **errp);
-    int (*bdrv_inactivate)(BlockDriverState *bs);
 
     /*
      * Flushes all data for all layers by calling bdrv_co_flush for underlying
@@ -414,14 +611,6 @@ struct BlockDriver {
      */
     int coroutine_fn (*bdrv_co_flush_to_os)(BlockDriverState *bs);
 
-    /*
-     * Drivers setting this field must be able to work with just a plain
-     * filename with '<protocol_name>:' as a prefix, and no other options.
-     * Options may be extracted from the filename by implementing
-     * bdrv_parse_filename.
-     */
-    const char *protocol_name;
-
     /*
      * Truncate @bs to @offset bytes using the given @prealloc mode
      * when growing.  Modes other than PREALLOC_MODE_OFF should be
@@ -443,47 +632,20 @@ struct BlockDriver {
     int64_t (*bdrv_get_allocated_file_size)(BlockDriverState *bs);
     BlockMeasureInfo *(*bdrv_measure)(QemuOpts *opts, BlockDriverState *in_bs,
                                       Error **errp);
-
     int coroutine_fn (*bdrv_co_pwritev_compressed)(BlockDriverState *bs,
         int64_t offset, int64_t bytes, QEMUIOVector *qiov);
     int coroutine_fn (*bdrv_co_pwritev_compressed_part)(BlockDriverState *bs,
         int64_t offset, int64_t bytes, QEMUIOVector *qiov,
         size_t qiov_offset);
 
-    int (*bdrv_snapshot_create)(BlockDriverState *bs,
-                                QEMUSnapshotInfo *sn_info);
-    int (*bdrv_snapshot_goto)(BlockDriverState *bs,
-                              const char *snapshot_id);
-    int (*bdrv_snapshot_delete)(BlockDriverState *bs,
-                                const char *snapshot_id,
-                                const char *name,
-                                Error **errp);
-    int (*bdrv_snapshot_list)(BlockDriverState *bs,
-                              QEMUSnapshotInfo **psn_info);
-    int (*bdrv_snapshot_load_tmp)(BlockDriverState *bs,
-                                  const char *snapshot_id,
-                                  const char *name,
-                                  Error **errp);
     int (*bdrv_get_info)(BlockDriverState *bs, BlockDriverInfo *bdi);
 
     ImageInfoSpecific *(*bdrv_get_specific_info)(BlockDriverState *bs,
                                                  Error **errp);
     BlockStatsSpecific *(*bdrv_get_specific_stats)(BlockDriverState *bs);
 
-    int coroutine_fn (*bdrv_save_vmstate)(BlockDriverState *bs,
-                                          QEMUIOVector *qiov,
-                                          int64_t pos);
-    int coroutine_fn (*bdrv_load_vmstate)(BlockDriverState *bs,
-                                          QEMUIOVector *qiov,
-                                          int64_t pos);
-
-    int (*bdrv_change_backing_file)(BlockDriverState *bs,
-        const char *backing_file, const char *backing_fmt);
-
     /* removable device specific */
     bool (*bdrv_is_inserted)(BlockDriverState *bs);
-    void (*bdrv_eject)(BlockDriverState *bs, bool eject_flag);
-    void (*bdrv_lock_medium)(BlockDriverState *bs, bool locked);
 
     /* to control generic scsi devices */
     BlockAIOCB *(*bdrv_aio_ioctl)(BlockDriverState *bs,
@@ -492,21 +654,6 @@ struct BlockDriver {
     int coroutine_fn (*bdrv_co_ioctl)(BlockDriverState *bs,
                                       unsigned long int req, void *buf);
 
-    /* List of options for creating images, terminated by name == NULL */
-    QemuOptsList *create_opts;
-
-    /* List of options for image amend */
-    QemuOptsList *amend_opts;
-
-    /*
-     * If this driver supports reopening images this contains a
-     * NULL-terminated list of the runtime options that can be
-     * modified. If an option in this list is unspecified during
-     * reopen then it _must_ be reset to its default value or return
-     * an error.
-     */
-    const char *const *mutable_opts;
-
     /*
      * Returns 0 for completed check, -errno for internal errors.
      * The check results are stored in result.
@@ -517,58 +664,10 @@ struct BlockDriver {
 
     void (*bdrv_debug_event)(BlockDriverState *bs, BlkdebugEvent event);
 
-    /* TODO Better pass a option string/QDict/QemuOpts to add any rule? */
-    int (*bdrv_debug_breakpoint)(BlockDriverState *bs, const char *event,
-        const char *tag);
-    int (*bdrv_debug_remove_breakpoint)(BlockDriverState *bs,
-        const char *tag);
-    int (*bdrv_debug_resume)(BlockDriverState *bs, const char *tag);
-    bool (*bdrv_debug_is_suspended)(BlockDriverState *bs, const char *tag);
-
-    void (*bdrv_refresh_limits)(BlockDriverState *bs, Error **errp);
-
-    /*
-     * Returns 1 if newly created images are guaranteed to contain only
-     * zeros, 0 otherwise.
-     */
-    int (*bdrv_has_zero_init)(BlockDriverState *bs);
-
-    /*
-     * Remove fd handlers, timers, and other event loop callbacks so the event
-     * loop is no longer in use.  Called with no in-flight requests and in
-     * depth-first traversal order with parents before child nodes.
-     */
-    void (*bdrv_detach_aio_context)(BlockDriverState *bs);
-
-    /*
-     * Add fd handlers, timers, and other event loop callbacks so I/O requests
-     * can be processed again.  Called with no in-flight requests and in
-     * depth-first traversal order with child nodes before parent nodes.
-     */
-    void (*bdrv_attach_aio_context)(BlockDriverState *bs,
-                                    AioContext *new_context);
-
     /* io queue for linux-aio */
     void (*bdrv_io_plug)(BlockDriverState *bs);
     void (*bdrv_io_unplug)(BlockDriverState *bs);
 
-    /**
-     * Try to get @bs's logical and physical block size.
-     * On success, store them in @bsz and return zero.
-     * On failure, return negative errno.
-     */
-    /* I/O API, even though if it's a filter jumps on parent */
-    int (*bdrv_probe_blocksizes)(BlockDriverState *bs, BlockSizes *bsz);
-    /**
-     * Try to get @bs's geometry (cyls, heads, sectors)
-     * On success, store them in @geo and return 0.
-     * On failure return -errno.
-     * Only drivers that want to override guest geometry implement this
-     * callback; see hd_geometry_guess().
-     */
-    /* I/O API, even though if it's a filter jumps on parent */
-    int (*bdrv_probe_geometry)(BlockDriverState *bs, HDGeometry *geo);
-
     /**
      * bdrv_co_drain_begin is called if implemented in the beginning of a
      * drain operation to drain and stop any internal sources of requests in
@@ -582,69 +681,6 @@ struct BlockDriver {
     void coroutine_fn (*bdrv_co_drain_begin)(BlockDriverState *bs);
     void coroutine_fn (*bdrv_co_drain_end)(BlockDriverState *bs);
 
-    void (*bdrv_add_child)(BlockDriverState *parent, BlockDriverState *child,
-                           Error **errp);
-    void (*bdrv_del_child)(BlockDriverState *parent, BdrvChild *child,
-                           Error **errp);
-
-    /**
-     * Informs the block driver that a permission change is intended. The
-     * driver checks whether the change is permissible and may take other
-     * preparations for the change (e.g. get file system locks). This operation
-     * is always followed either by a call to either .bdrv_set_perm or
-     * .bdrv_abort_perm_update.
-     *
-     * Checks whether the requested set of cumulative permissions in @perm
-     * can be granted for accessing @bs and whether no other users are using
-     * permissions other than those given in @shared (both arguments take
-     * BLK_PERM_* bitmasks).
-     *
-     * If both conditions are met, 0 is returned. Otherwise, -errno is returned
-     * and errp is set to an error describing the conflict.
-     */
-    int (*bdrv_check_perm)(BlockDriverState *bs, uint64_t perm,
-                           uint64_t shared, Error **errp);
-
-    /**
-     * Called to inform the driver that the set of cumulative set of used
-     * permissions for @bs has changed to @perm, and the set of sharable
-     * permission to @shared. The driver can use this to propagate changes to
-     * its children (i.e. request permissions only if a parent actually needs
-     * them).
-     *
-     * This function is only invoked after bdrv_check_perm(), so block drivers
-     * may rely on preparations made in their .bdrv_check_perm implementation.
-     */
-    void (*bdrv_set_perm)(BlockDriverState *bs, uint64_t perm, uint64_t shared);
-
-    /*
-     * Called to inform the driver that after a previous bdrv_check_perm()
-     * call, the permission update is not performed and any preparations made
-     * for it (e.g. taken file locks) need to be undone.
-     *
-     * This function can be called even for nodes that never saw a
-     * bdrv_check_perm() call. It is a no-op then.
-     */
-    void (*bdrv_abort_perm_update)(BlockDriverState *bs);
-
-    /**
-     * Returns in @nperm and @nshared the permissions that the driver for @bs
-     * needs on its child @c, based on the cumulative permissions requested by
-     * the parents in @parent_perm and @parent_shared.
-     *
-     * If @c is NULL, return the permissions for attaching a new child for the
-     * given @child_class and @role.
-     *
-     * If @reopen_queue is non-NULL, don't return the currently needed
-     * permissions, but those that will be needed after applying the
-     * @reopen_queue.
-     */
-     void (*bdrv_child_perm)(BlockDriverState *bs, BdrvChild *c,
-                             BdrvChildRole role,
-                             BlockReopenQueue *reopen_queue,
-                             uint64_t parent_perm, uint64_t parent_shared,
-                             uint64_t *nperm, uint64_t *nshared);
-
     bool (*bdrv_supports_persistent_dirty_bitmap)(BlockDriverState *bs);
     bool (*bdrv_co_can_store_new_dirty_bitmap)(BlockDriverState *bs,
                                                const char *name,
@@ -653,26 +689,6 @@ struct BlockDriver {
     int (*bdrv_co_remove_persistent_dirty_bitmap)(BlockDriverState *bs,
                                                   const char *name,
                                                   Error **errp);
-
-    /**
-     * Register/unregister a buffer for I/O. For example, when the driver is
-     * interested to know the memory areas that will later be used in iovs, so
-     * that it can do IOMMU mapping with VFIO etc., in order to get better
-     * performance. In the case of VFIO drivers, this callback is used to do
-     * DMA mapping for hot buffers.
-     */
-    void (*bdrv_register_buf)(BlockDriverState *bs, void *host, size_t size);
-    void (*bdrv_unregister_buf)(BlockDriverState *bs, void *host);
-    QLIST_ENTRY(BlockDriver) list;
-
-    /*
-     * Pointer to a NULL-terminated array of names of strong options
-     * that can be specified for bdrv_open(). A strong option is one
-     * that changes the data of a BDS.
-     * If this pointer is NULL, the array is considered empty.
-     * "filename" and "driver" are always considered strong.
-     */
-    const char *const *strong_runtime_opts;
 };
 
 static inline bool block_driver_can_compress(BlockDriver *drv)
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 20/25] block_int-common.h: assertion in the callers of BlockDriver function pointers
  2021-10-25 10:17 [PATCH v4 00/25] block layer: split block APIs in global state and I/O Emanuele Giuseppe Esposito
                   ` (18 preceding siblings ...)
  2021-10-25 10:17 ` [PATCH v4 19/25] block_int-common.h: split function pointers in BlockDriver Emanuele Giuseppe Esposito
@ 2021-10-25 10:17 ` Emanuele Giuseppe Esposito
  2021-11-15 12:48   ` Hanna Reitz
  2021-10-25 10:17 ` [PATCH v4 21/25] block_int-common.h: split function pointers in BdrvChildClass Emanuele Giuseppe Esposito
                   ` (7 subsequent siblings)
  27 siblings, 1 reply; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-10-25 10:17 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Emanuele Giuseppe Esposito, Richard Henderson, Markus Armbruster,
	Dr. David Alan Gilbert, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Eric Blake

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/block.c b/block.c
index 94bff5c757..40c4729b8d 100644
--- a/block.c
+++ b/block.c
@@ -1074,6 +1074,7 @@ int refresh_total_sectors(BlockDriverState *bs, int64_t hint)
 static void bdrv_join_options(BlockDriverState *bs, QDict *options,
                               QDict *old_options)
 {
+    assert(qemu_in_main_thread());
     if (bs->drv && bs->drv->bdrv_join_options) {
         bs->drv->bdrv_join_options(options, old_options);
     } else {
@@ -1566,6 +1567,7 @@ static int bdrv_open_driver(BlockDriverState *bs, BlockDriver *drv,
 {
     Error *local_err = NULL;
     int i, ret;
+    assert(qemu_in_main_thread());
 
     bdrv_assign_node_name(bs, node_name, &local_err);
     if (local_err) {
@@ -1955,6 +1957,8 @@ static int bdrv_fill_options(QDict **options, const char *filename,
     BlockDriver *drv = NULL;
     Error *local_err = NULL;
 
+    assert(qemu_in_main_thread());
+
     /*
      * Caution: while qdict_get_try_str() is fine, getting non-string
      * types would require more care.  When @options come from
@@ -2148,6 +2152,7 @@ static void bdrv_child_perm(BlockDriverState *bs, BlockDriverState *child_bs,
                             uint64_t *nperm, uint64_t *nshared)
 {
     assert(bs->drv && bs->drv->bdrv_child_perm);
+    assert(qemu_in_main_thread());
     bs->drv->bdrv_child_perm(bs, c, role, reopen_queue,
                              parent_perm, parent_shared,
                              nperm, nshared);
@@ -2231,6 +2236,7 @@ static void bdrv_drv_set_perm_commit(void *opaque)
 {
     BlockDriverState *bs = opaque;
     uint64_t cumulative_perms, cumulative_shared_perms;
+    assert(qemu_in_main_thread());
 
     if (bs->drv->bdrv_set_perm) {
         bdrv_get_cumulative_perm(bs, &cumulative_perms,
@@ -2242,6 +2248,7 @@ static void bdrv_drv_set_perm_commit(void *opaque)
 static void bdrv_drv_set_perm_abort(void *opaque)
 {
     BlockDriverState *bs = opaque;
+    assert(qemu_in_main_thread());
 
     if (bs->drv->bdrv_abort_perm_update) {
         bs->drv->bdrv_abort_perm_update(bs);
@@ -2257,6 +2264,7 @@ static int bdrv_drv_set_perm(BlockDriverState *bs, uint64_t perm,
                              uint64_t shared_perm, Transaction *tran,
                              Error **errp)
 {
+    assert(qemu_in_main_thread());
     if (!bs->drv) {
         return 0;
     }
@@ -4221,6 +4229,7 @@ int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, Error **errp)
 
     assert(qemu_get_current_aio_context() == qemu_get_aio_context());
     assert(bs_queue != NULL);
+    assert(qemu_in_main_thread());
 
     QTAILQ_FOREACH(bs_entry, bs_queue, entry) {
         ctx = bdrv_get_aio_context(bs_entry->state.bs);
@@ -4484,6 +4493,7 @@ static int bdrv_reopen_prepare(BDRVReopenState *reopen_state,
 
     assert(reopen_state != NULL);
     assert(reopen_state->bs->drv != NULL);
+    assert(qemu_in_main_thread());
     drv = reopen_state->bs->drv;
 
     /* This function and each driver's bdrv_reopen_prepare() remove
@@ -4694,6 +4704,7 @@ static void bdrv_reopen_commit(BDRVReopenState *reopen_state)
     bs = reopen_state->bs;
     drv = bs->drv;
     assert(drv != NULL);
+    assert(qemu_in_main_thread());
 
     /* If there are any driver level actions to take */
     if (drv->bdrv_reopen_commit) {
@@ -4735,6 +4746,7 @@ static void bdrv_reopen_abort(BDRVReopenState *reopen_state)
     assert(reopen_state != NULL);
     drv = reopen_state->bs->drv;
     assert(drv != NULL);
+    assert(qemu_in_main_thread());
 
     if (drv->bdrv_reopen_abort) {
         drv->bdrv_reopen_abort(reopen_state);
@@ -4748,6 +4760,7 @@ static void bdrv_close(BlockDriverState *bs)
     BdrvChild *child, *next;
 
     assert(!bs->refcnt);
+    assert(qemu_in_main_thread());
 
     bdrv_drained_begin(bs); /* complete I/O */
     bdrv_flush(bs);
@@ -6499,6 +6512,8 @@ static int bdrv_inactivate_recurse(BlockDriverState *bs)
     int ret;
     uint64_t cumulative_perms, cumulative_shared_perms;
 
+    assert(qemu_in_main_thread());
+
     if (!bs->drv) {
         return -ENOMEDIUM;
     }
@@ -7007,6 +7022,7 @@ static void bdrv_detach_aio_context(BlockDriverState *bs)
     BdrvAioNotifier *baf, *baf_tmp;
 
     assert(!bs->walking_aio_notifiers);
+    assert(qemu_in_main_thread());
     bs->walking_aio_notifiers = true;
     QLIST_FOREACH_SAFE(baf, &bs->aio_notifiers, list, baf_tmp) {
         if (baf->deleted) {
@@ -7034,6 +7050,7 @@ static void bdrv_attach_aio_context(BlockDriverState *bs,
                                     AioContext *new_context)
 {
     BdrvAioNotifier *ban, *ban_tmp;
+    assert(qemu_in_main_thread());
 
     if (bs->quiesce_counter) {
         aio_disable_external(new_context);
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 21/25] block_int-common.h: split function pointers in BdrvChildClass
  2021-10-25 10:17 [PATCH v4 00/25] block layer: split block APIs in global state and I/O Emanuele Giuseppe Esposito
                   ` (19 preceding siblings ...)
  2021-10-25 10:17 ` [PATCH v4 20/25] block_int-common.h: assertion in the callers of BlockDriver function pointers Emanuele Giuseppe Esposito
@ 2021-10-25 10:17 ` Emanuele Giuseppe Esposito
  2021-11-15 14:36   ` Hanna Reitz
  2021-10-25 10:17 ` [PATCH v4 22/25] block_int-common.h: assertions in the callers of BdrvChildClass function pointers Emanuele Giuseppe Esposito
                   ` (6 subsequent siblings)
  27 siblings, 1 reply; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-10-25 10:17 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Emanuele Giuseppe Esposito, Richard Henderson, Markus Armbruster,
	Dr. David Alan Gilbert, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Eric Blake

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/block/block_int-common.h | 51 ++++++++++++++++++++------------
 1 file changed, 32 insertions(+), 19 deletions(-)

diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 9857e775fe..ea16099c53 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -805,12 +805,16 @@ struct BdrvChildClass {
      */
     bool parent_is_bds;
 
+    /*
+     * Global state (GS) API. These functions run under the BQL lock.
+     *
+     * See include/block/block-global-state.h for more information about
+     * the GS API.
+     */
     void (*inherit_options)(BdrvChildRole role, bool parent_is_format,
                             int *child_flags, QDict *child_options,
                             int parent_flags, QDict *parent_options);
-
     void (*change_media)(BdrvChild *child, bool load);
-    void (*resize)(BdrvChild *child);
 
     /*
      * Returns a name that is supposedly more useful for human users than the
@@ -827,6 +831,32 @@ struct BdrvChildClass {
      */
     char *(*get_parent_desc)(BdrvChild *child);
 
+    void (*attach)(BdrvChild *child);
+    void (*detach)(BdrvChild *child);
+
+    /*
+     * Notifies the parent that the filename of its child has changed (e.g.
+     * because the direct child was removed from the backing chain), so that it
+     * can update its reference.
+     */
+    int (*update_filename)(BdrvChild *child, BlockDriverState *new_base,
+                           const char *filename, Error **errp);
+
+    bool (*can_set_aio_ctx)(BdrvChild *child, AioContext *ctx,
+                        GSList **ignore, Error **errp);
+    void (*set_aio_ctx)(BdrvChild *child, AioContext *ctx, GSList **ignore);
+
+    AioContext *(*get_parent_aio_context)(BdrvChild *child);
+
+    /*
+     * I/O API functions. These functions are thread-safe.
+     *
+     * See include/block/block-io.h for more information about
+     * the I/O API.
+     */
+
+    void (*resize)(BdrvChild *child);
+
     /*
      * If this pair of functions is implemented, the parent doesn't issue new
      * requests after returning from .drained_begin() until .drained_end() is
@@ -859,23 +889,6 @@ struct BdrvChildClass {
      */
     void (*activate)(BdrvChild *child, Error **errp);
     int (*inactivate)(BdrvChild *child);
-
-    void (*attach)(BdrvChild *child);
-    void (*detach)(BdrvChild *child);
-
-    /*
-     * Notifies the parent that the filename of its child has changed (e.g.
-     * because the direct child was removed from the backing chain), so that it
-     * can update its reference.
-     */
-    int (*update_filename)(BdrvChild *child, BlockDriverState *new_base,
-                           const char *filename, Error **errp);
-
-    bool (*can_set_aio_ctx)(BdrvChild *child, AioContext *ctx,
-                            GSList **ignore, Error **errp);
-    void (*set_aio_ctx)(BdrvChild *child, AioContext *ctx, GSList **ignore);
-
-    AioContext *(*get_parent_aio_context)(BdrvChild *child);
 };
 
 extern const BdrvChildClass child_of_bds;
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 22/25] block_int-common.h: assertions in the callers of BdrvChildClass function pointers
  2021-10-25 10:17 [PATCH v4 00/25] block layer: split block APIs in global state and I/O Emanuele Giuseppe Esposito
                   ` (20 preceding siblings ...)
  2021-10-25 10:17 ` [PATCH v4 21/25] block_int-common.h: split function pointers in BdrvChildClass Emanuele Giuseppe Esposito
@ 2021-10-25 10:17 ` Emanuele Giuseppe Esposito
  2021-11-15 14:48   ` Hanna Reitz
  2021-10-25 10:17 ` [PATCH v4 23/25] block-backend-common.h: split function pointers in BlockDevOps Emanuele Giuseppe Esposito
                   ` (5 subsequent siblings)
  27 siblings, 1 reply; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-10-25 10:17 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Emanuele Giuseppe Esposito, Richard Henderson, Markus Armbruster,
	Dr. David Alan Gilbert, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Eric Blake

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/block.c b/block.c
index 40c4729b8d..da80e89ad4 100644
--- a/block.c
+++ b/block.c
@@ -1462,6 +1462,7 @@ const BdrvChildClass child_of_bds = {
 
 AioContext *bdrv_child_get_parent_aio_context(BdrvChild *c)
 {
+    assert(qemu_in_main_thread());
     return c->klass->get_parent_aio_context(c);
 }
 
@@ -2085,6 +2086,7 @@ bool bdrv_is_writable(BlockDriverState *bs)
 
 static char *bdrv_child_user_desc(BdrvChild *c)
 {
+    assert(qemu_in_main_thread());
     return c->klass->get_parent_desc(c);
 }
 
@@ -2718,6 +2720,7 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
     int drain_saldo;
 
     assert(!child->frozen);
+    assert(qemu_in_main_thread());
 
     if (old_bs && new_bs) {
         assert(bdrv_get_aio_context(old_bs) == bdrv_get_aio_context(new_bs));
@@ -2806,6 +2809,8 @@ static void bdrv_attach_child_common_abort(void *opaque)
     BdrvChild *child = *s->child;
     BlockDriverState *bs = child->bs;
 
+    assert(qemu_in_main_thread());
+
     bdrv_replace_child_noperm(child, NULL);
 
     if (bdrv_get_aio_context(bs) != s->old_child_ctx) {
@@ -3164,6 +3169,7 @@ void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child)
 static void bdrv_parent_cb_change_media(BlockDriverState *bs, bool load)
 {
     BdrvChild *c;
+    assert(qemu_in_main_thread());
     QLIST_FOREACH(c, &bs->parents, next_parent) {
         if (c->klass->change_media) {
             c->klass->change_media(c, load);
@@ -3655,6 +3661,7 @@ static BlockDriverState *bdrv_open_inherit(const char *filename,
 
     assert(!child_class || !flags);
     assert(!child_class == !parent);
+    assert(qemu_in_main_thread());
 
     if (reference) {
         bool options_non_empty = options ? qdict_size(options) : false;
@@ -4041,6 +4048,7 @@ static BlockReopenQueue *bdrv_reopen_queue_child(BlockReopenQueue *bs_queue,
      * important to avoid graph changes between the recursive queuing here and
      * bdrv_reopen_multiple(). */
     assert(bs->quiesce_counter > 0);
+    assert(qemu_in_main_thread());
 
     if (bs_queue == NULL) {
         bs_queue = g_new0(BlockReopenQueue, 1);
@@ -7097,6 +7105,7 @@ void bdrv_set_aio_context_ignore(BlockDriverState *bs,
     BdrvChild *child, *parent;
 
     g_assert(qemu_get_current_aio_context() == qemu_get_aio_context());
+    assert(qemu_in_main_thread());
 
     if (old_context == new_context) {
         return;
@@ -7173,6 +7182,7 @@ static bool bdrv_parent_can_set_aio_context(BdrvChild *c, AioContext *ctx,
         return true;
     }
     *ignore = g_slist_prepend(*ignore, c);
+    assert(qemu_in_main_thread());
 
     /*
      * A BdrvChildClass that doesn't handle AioContext changes cannot
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 23/25] block-backend-common.h: split function pointers in BlockDevOps
  2021-10-25 10:17 [PATCH v4 00/25] block layer: split block APIs in global state and I/O Emanuele Giuseppe Esposito
                   ` (21 preceding siblings ...)
  2021-10-25 10:17 ` [PATCH v4 22/25] block_int-common.h: assertions in the callers of BdrvChildClass function pointers Emanuele Giuseppe Esposito
@ 2021-10-25 10:17 ` Emanuele Giuseppe Esposito
  2021-10-25 14:10   ` Philippe Mathieu-Daudé
  2021-10-25 10:17 ` [PATCH v4 24/25] job.h: split function pointers in JobDriver Emanuele Giuseppe Esposito
                   ` (4 subsequent siblings)
  27 siblings, 1 reply; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-10-25 10:17 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Emanuele Giuseppe Esposito, Richard Henderson, Markus Armbruster,
	Dr. David Alan Gilbert, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Eric Blake

Assertions in the callers of the funciton pointrs are already
added by previous patches.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/sysemu/block-backend-common.h | 28 ++++++++++++++++++++++-----
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/include/sysemu/block-backend-common.h b/include/sysemu/block-backend-common.h
index 52ff6a4d26..9ffd098458 100644
--- a/include/sysemu/block-backend-common.h
+++ b/include/sysemu/block-backend-common.h
@@ -17,6 +17,14 @@
 
 /* Callbacks for block device models */
 typedef struct BlockDevOps {
+
+    /*
+     * Global state (GS) API. These functions run under the BQL lock.
+     *
+     * See include/block/block-global-state.h for more information about
+     * the GS API.
+     */
+
     /*
      * Runs when virtual media changed (monitor commands eject, change)
      * Argument load is true on load and false on eject.
@@ -34,16 +42,26 @@ typedef struct BlockDevOps {
      * true, even if they do not support eject requests.
      */
     void (*eject_request_cb)(void *opaque, bool force);
-    /*
-     * Is the virtual tray open?
-     * Device models implement this only when the device has a tray.
-     */
-    bool (*is_tray_open)(void *opaque);
+
     /*
      * Is the virtual medium locked into the device?
      * Device models implement this only when device has such a lock.
      */
     bool (*is_medium_locked)(void *opaque);
+
+    /*
+     * I/O API functions. These functions are thread-safe.
+     *
+     * See include/block/block-io.h for more information about
+     * the I/O API.
+     */
+
+    /*
+     * Is the virtual tray open?
+     * Device models implement this only when the device has a tray.
+     */
+    bool (*is_tray_open)(void *opaque);
+
     /*
      * Runs when the size changed (e.g. monitor command block_resize)
      */
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 24/25] job.h: split function pointers in JobDriver
  2021-10-25 10:17 [PATCH v4 00/25] block layer: split block APIs in global state and I/O Emanuele Giuseppe Esposito
                   ` (22 preceding siblings ...)
  2021-10-25 10:17 ` [PATCH v4 23/25] block-backend-common.h: split function pointers in BlockDevOps Emanuele Giuseppe Esposito
@ 2021-10-25 10:17 ` Emanuele Giuseppe Esposito
  2021-11-15 15:11   ` Hanna Reitz
  2021-10-25 10:17 ` [PATCH v4 25/25] job.h: assertions in the callers of JobDriver funcion pointers Emanuele Giuseppe Esposito
                   ` (3 subsequent siblings)
  27 siblings, 1 reply; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-10-25 10:17 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Emanuele Giuseppe Esposito, Richard Henderson, Markus Armbruster,
	Dr. David Alan Gilbert, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Eric Blake

The job API will be handled separately in another serie.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/qemu/job.h | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 6e67b6977f..7e9e59f4b8 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -169,12 +169,21 @@ typedef struct Job {
  * Callbacks and other information about a Job driver.
  */
 struct JobDriver {
+
+    /* Fields initialized in struct definition and never changed. */
+
     /** Derived Job struct size */
     size_t instance_size;
 
     /** Enum describing the operation */
     JobType job_type;
 
+    /*
+     * Functions run without regard to the BQL and may run in any
+     * arbitrary thread. These functions do not need to be thread-safe
+     * because the caller ensures that are invoked from one thread at time.
+     */
+
     /**
      * Mandatory: Entrypoint for the Coroutine.
      *
@@ -201,6 +210,13 @@ struct JobDriver {
      */
     void coroutine_fn (*resume)(Job *job);
 
+    /*
+     * Global state (GS) API. These functions run under the BQL lock.
+     *
+     * See include/block/block-global-state.h for more information about
+     * the GS API.
+     */
+
     /**
      * Called when the job is resumed by the user (i.e. user_paused becomes
      * false). .user_resume is called before .resume.
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v4 25/25] job.h: assertions in the callers of JobDriver funcion pointers
  2021-10-25 10:17 [PATCH v4 00/25] block layer: split block APIs in global state and I/O Emanuele Giuseppe Esposito
                   ` (23 preceding siblings ...)
  2021-10-25 10:17 ` [PATCH v4 24/25] job.h: split function pointers in JobDriver Emanuele Giuseppe Esposito
@ 2021-10-25 10:17 ` Emanuele Giuseppe Esposito
  2021-10-25 14:09 ` [PATCH v4 00/25] block layer: split block APIs in global state and I/O Philippe Mathieu-Daudé
                   ` (2 subsequent siblings)
  27 siblings, 0 replies; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-10-25 10:17 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Emanuele Giuseppe Esposito, Richard Henderson, Markus Armbruster,
	Dr. David Alan Gilbert, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, Eric Blake

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 job.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/job.c b/job.c
index dbfa67bb0a..94b142684f 100644
--- a/job.c
+++ b/job.c
@@ -380,6 +380,8 @@ void job_ref(Job *job)
 
 void job_unref(Job *job)
 {
+    assert(qemu_in_main_thread());
+
     if (--job->refcnt == 0) {
         assert(job->status == JOB_STATUS_NULL);
         assert(!timer_pending(&job->sleep_timer));
@@ -601,6 +603,7 @@ bool job_user_paused(Job *job)
 void job_user_resume(Job *job, Error **errp)
 {
     assert(job);
+    assert(qemu_in_main_thread());
     if (!job->user_paused || job->pause_count <= 0) {
         error_setg(errp, "Can't resume a job that was not paused");
         return;
@@ -671,6 +674,7 @@ static void job_update_rc(Job *job)
 static void job_commit(Job *job)
 {
     assert(!job->ret);
+    assert(qemu_in_main_thread());
     if (job->driver->commit) {
         job->driver->commit(job);
     }
@@ -679,6 +683,7 @@ static void job_commit(Job *job)
 static void job_abort(Job *job)
 {
     assert(job->ret);
+    assert(qemu_in_main_thread());
     if (job->driver->abort) {
         job->driver->abort(job);
     }
@@ -686,6 +691,7 @@ static void job_abort(Job *job)
 
 static void job_clean(Job *job)
 {
+    assert(qemu_in_main_thread());
     if (job->driver->clean) {
         job->driver->clean(job);
     }
@@ -725,6 +731,7 @@ static int job_finalize_single(Job *job)
 
 static void job_cancel_async(Job *job, bool force)
 {
+    assert(qemu_in_main_thread());
     if (job->driver->cancel) {
         force = job->driver->cancel(job, force);
     } else {
@@ -824,6 +831,7 @@ static void job_completed_txn_abort(Job *job)
 
 static int job_prepare(Job *job)
 {
+    assert(qemu_in_main_thread());
     if (job->ret == 0 && job->driver->prepare) {
         job->ret = job->driver->prepare(job);
         job_update_rc(job);
@@ -1053,6 +1061,7 @@ void job_complete(Job *job, Error **errp)
 {
     /* Should not be reachable via external interface for internal jobs */
     assert(job->id);
+    assert(qemu_in_main_thread());
     if (job_apply_verb(job, JOB_VERB_COMPLETE, errp)) {
         return;
     }
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 01/25] main-loop.h: introduce qemu_in_main_thread()
  2021-10-25 10:17 ` [PATCH v4 01/25] main-loop.h: introduce qemu_in_main_thread() Emanuele Giuseppe Esposito
@ 2021-10-25 11:33   ` Philippe Mathieu-Daudé
  0 siblings, 0 replies; 86+ messages in thread
From: Philippe Mathieu-Daudé @ 2021-10-25 11:33 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, Eric Blake, Richard Henderson,
	qemu-devel, Markus Armbruster, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, John Snow, Dr. David Alan Gilbert

On 10/25/21 12:17, Emanuele Giuseppe Esposito wrote:
> When invoked from the main loop, this function is the same
> as qemu_mutex_iothread_locked, and returns true if the BQL is held.
> When invoked from iothreads or tests, it returns true only
> if the current AioContext is the Main Loop.
> 
> This essentially just extends qemu_mutex_iothread_locked to work
> also in unit tests or other users like storage-daemon, that run
> in the Main Loop but end up using the implementation in
> stubs/iothread-lock.c.
> 
> Using qemu_mutex_iothread_locked in unit tests defaults to false
> because they use the implementation in stubs/iothread-lock,
> making all assertions added in next patches fail despite the

"in the following commits"?

> AioContext is still the main loop.
> 
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  include/qemu/main-loop.h | 13 +++++++++++++
>  softmmu/cpus.c           |  5 +++++
>  stubs/iothread-lock.c    |  5 +++++
>  3 files changed, 23 insertions(+)

Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 02/25] include/block/block: split header into I/O and global state API
  2021-10-25 10:17 ` [PATCH v4 02/25] include/block/block: split header into I/O and global state API Emanuele Giuseppe Esposito
@ 2021-10-25 11:37   ` Philippe Mathieu-Daudé
  2021-10-25 12:22     ` Emanuele Giuseppe Esposito
  2021-11-11 15:00   ` Hanna Reitz
  2021-11-12 12:25   ` Hanna Reitz
  2 siblings, 1 reply; 86+ messages in thread
From: Philippe Mathieu-Daudé @ 2021-10-25 11:37 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, Eric Blake, Richard Henderson,
	qemu-devel, Markus Armbruster, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, John Snow, Dr. David Alan Gilbert

On 10/25/21 12:17, Emanuele Giuseppe Esposito wrote:
> block.h currently contains a mix of functions:
> some of them run under the BQL and modify the block layer graph,
> others are instead thread-safe and perform I/O in iothreads.
> It is not easy to understand which function is part of which
> group (I/O vs GS), and this patch aims to clarify it.
> 
> The "GS" functions need the BQL, and often use
> aio_context_acquire/release and/or drain to be sure they
> can modify the graph safely.
> The I/O function are instead thread safe, and can run in
> any AioContext.
> 
> By splitting the header in two files, block-io.h
> and block-global-state.h we have a clearer view on what
> needs what kind of protection. block-common.h
> contains common structures shared by both headers.
> 
> block.h is left there for legacy and to avoid changing
> all includes in all c files that use the block APIs.
> 
> Assertions are added in the next patch.
> 
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  block.c                            |   3 +
>  block/meson.build                  |   7 +-
>  include/block/block-common.h       | 389 +++++++++++++

Can this patch be split in 3?

(first)

>  include/block/block-global-state.h | 286 ++++++++++

(second)

>  include/block/block-io.h           | 306 ++++++++++

(third)

>  include/block/block.h              | 878 +----------------------------
>  6 files changed, 1012 insertions(+), 857 deletions(-)
>  create mode 100644 include/block/block-common.h
>  create mode 100644 include/block/block-global-state.h
>  create mode 100644 include/block/block-io.h

Also consider enabling scripts/git.orderfile to ease patch review.



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 02/25] include/block/block: split header into I/O and global state API
  2021-10-25 11:37   ` Philippe Mathieu-Daudé
@ 2021-10-25 12:22     ` Emanuele Giuseppe Esposito
  0 siblings, 0 replies; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-10-25 12:22 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, Eric Blake, Richard Henderson,
	qemu-devel, Markus Armbruster, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, John Snow, Dr. David Alan Gilbert


>>
>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>> ---
>>   block.c                            |   3 +
>>   block/meson.build                  |   7 +-
>>   include/block/block-common.h       | 389 +++++++++++++
> 
> Can this patch be split in 3?
> 
> (first)
> 
>>   include/block/block-global-state.h | 286 ++++++++++
> 
> (second)
> 
>>   include/block/block-io.h           | 306 ++++++++++
> 
> (third)

I think it is a good idea especially for future patches, since it 
improves readability. For this series I think it has already been fully 
reviewed, so it won't matter too much. But I will follow this logic for 
the upcoming job patches, thanks.

> 
>>   include/block/block.h              | 878 +----------------------------
>>   6 files changed, 1012 insertions(+), 857 deletions(-)
>>   create mode 100644 include/block/block-common.h
>>   create mode 100644 include/block/block-global-state.h
>>   create mode 100644 include/block/block-io.h
> 
> Also consider enabling scripts/git.orderfile to ease patch review.
> 

Done, thanks for pointing it out.

Emanuele



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 00/25] block layer: split block APIs in global state and I/O
  2021-10-25 10:17 [PATCH v4 00/25] block layer: split block APIs in global state and I/O Emanuele Giuseppe Esposito
                   ` (24 preceding siblings ...)
  2021-10-25 10:17 ` [PATCH v4 25/25] job.h: assertions in the callers of JobDriver funcion pointers Emanuele Giuseppe Esposito
@ 2021-10-25 14:09 ` Philippe Mathieu-Daudé
  2021-10-28 15:45   ` Stefan Hajnoczi
  2021-10-28 15:49 ` Stefan Hajnoczi
  2021-11-15 16:03 ` Hanna Reitz
  27 siblings, 1 reply; 86+ messages in thread
From: Philippe Mathieu-Daudé @ 2021-10-25 14:09 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, Eric Blake, Richard Henderson,
	qemu-devel, Markus Armbruster, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, John Snow, Dr. David Alan Gilbert

On 10/25/21 12:17, Emanuele Giuseppe Esposito wrote:
[...]

> Each function in the GS API will have an assertion, checking
> that it is always running under BQL.
> I/O functions are instead thread safe (or so should be), meaning
> that they *can* run under BQL, but also in an iothread in another
> AioContext. Therefore they do not provide any assertion, and
> need to be audited manually to verify the correctness.
> 
> Adding assetions has helped finding 2 bugs already, as shown in
> my series "Migration: fix missing iothread locking".
> 
> Tested this series by running unit tests, qemu-iotests and qtests
> (x86_64).
> Some functions in the GS API are used everywhere but not
> properly tested. Therefore their assertion is never actually run in
> the tests, so despite my very careful auditing, it is not impossible
> to exclude that some will trigger while actually using QEMU.
> 
> Patch 1 introduces qemu_in_main_thread(), the function used in
> all assertions. This had to be introduced otherwise all unit tests
> would fail, since they run in the main loop but use the code in
> stubs/iothread.c
> Patches 2-14 and 19-25 (with the exception of patch 9, that is an additional
> assert) are all structured in the same way: first we split the header
> and in the next (even) patch we add assertions.
> The rest of the patches ontain either both assertions and split,
> or have no assertions.

This seems a lot of assertions added in hot-path code.

Does it makes sense to use a BLOCK_ASSERT() macro instead,
only expanded when configure with --enable-debug?



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 23/25] block-backend-common.h: split function pointers in BlockDevOps
  2021-10-25 10:17 ` [PATCH v4 23/25] block-backend-common.h: split function pointers in BlockDevOps Emanuele Giuseppe Esposito
@ 2021-10-25 14:10   ` Philippe Mathieu-Daudé
  0 siblings, 0 replies; 86+ messages in thread
From: Philippe Mathieu-Daudé @ 2021-10-25 14:10 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, Eric Blake, Richard Henderson,
	qemu-devel, Markus Armbruster, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, John Snow, Dr. David Alan Gilbert

On 10/25/21 12:17, Emanuele Giuseppe Esposito wrote:
> Assertions in the callers of the funciton pointrs are already
> added by previous patches.

Typo "function pointers".

> 
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  include/sysemu/block-backend-common.h | 28 ++++++++++++++++++++++-----
>  1 file changed, 23 insertions(+), 5 deletions(-)

Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 00/25] block layer: split block APIs in global state and I/O
  2021-10-25 14:09 ` [PATCH v4 00/25] block layer: split block APIs in global state and I/O Philippe Mathieu-Daudé
@ 2021-10-28 15:45   ` Stefan Hajnoczi
  0 siblings, 0 replies; 86+ messages in thread
From: Stefan Hajnoczi @ 2021-10-28 15:45 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: Emanuele Giuseppe Esposito, Kevin Wolf,
	Vladimir Sementsov-Ogievskiy, Daniel P. Berrangé,
	Eduardo Habkost, qemu-block, Juan Quintela, Eric Blake,
	Richard Henderson, qemu-devel, Markus Armbruster, Hanna Reitz,
	Paolo Bonzini, Fam Zheng, John Snow, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 1932 bytes --]

On Mon, Oct 25, 2021 at 04:09:41PM +0200, Philippe Mathieu-Daudé wrote:
> On 10/25/21 12:17, Emanuele Giuseppe Esposito wrote:
> [...]
> 
> > Each function in the GS API will have an assertion, checking
> > that it is always running under BQL.
> > I/O functions are instead thread safe (or so should be), meaning
> > that they *can* run under BQL, but also in an iothread in another
> > AioContext. Therefore they do not provide any assertion, and
> > need to be audited manually to verify the correctness.
> > 
> > Adding assetions has helped finding 2 bugs already, as shown in
> > my series "Migration: fix missing iothread locking".
> > 
> > Tested this series by running unit tests, qemu-iotests and qtests
> > (x86_64).
> > Some functions in the GS API are used everywhere but not
> > properly tested. Therefore their assertion is never actually run in
> > the tests, so despite my very careful auditing, it is not impossible
> > to exclude that some will trigger while actually using QEMU.
> > 
> > Patch 1 introduces qemu_in_main_thread(), the function used in
> > all assertions. This had to be introduced otherwise all unit tests
> > would fail, since they run in the main loop but use the code in
> > stubs/iothread.c
> > Patches 2-14 and 19-25 (with the exception of patch 9, that is an additional
> > assert) are all structured in the same way: first we split the header
> > and in the next (even) patch we add assertions.
> > The rest of the patches ontain either both assertions and split,
> > or have no assertions.
> 
> This seems a lot of assertions added in hot-path code.
> 
> Does it makes sense to use a BLOCK_ASSERT() macro instead,
> only expanded when configure with --enable-debug?

I think the assertions are only in the slow path (functions that must be
run with the BQL held from the main thread). The I/O request code path
does not have new assertions.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 14/25] include/systemu/blockdev.h: global state API
  2021-10-25 10:17 ` [PATCH v4 14/25] include/systemu/blockdev.h: global state API Emanuele Giuseppe Esposito
@ 2021-10-28 15:48   ` Stefan Hajnoczi
  2021-11-12 15:46   ` Hanna Reitz
  1 sibling, 0 replies; 86+ messages in thread
From: Stefan Hajnoczi @ 2021-10-28 15:48 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, qemu-block, Juan Quintela, qemu-devel,
	John Snow, Richard Henderson, Markus Armbruster,
	Dr. David Alan Gilbert, Hanna Reitz, Paolo Bonzini, Eric Blake

[-- Attachment #1: Type: text/plain, Size: 369 bytes --]

On Mon, Oct 25, 2021 at 06:17:24AM -0400, Emanuele Giuseppe Esposito wrote:
> blockdev functions run always under the BQL lock.
> 
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> ---
>  include/sysemu/blockdev.h | 18 ++++++++++++++----
>  1 file changed, 14 insertions(+), 4 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 00/25] block layer: split block APIs in global state and I/O
  2021-10-25 10:17 [PATCH v4 00/25] block layer: split block APIs in global state and I/O Emanuele Giuseppe Esposito
                   ` (25 preceding siblings ...)
  2021-10-25 14:09 ` [PATCH v4 00/25] block layer: split block APIs in global state and I/O Philippe Mathieu-Daudé
@ 2021-10-28 15:49 ` Stefan Hajnoczi
  2021-11-15 16:03 ` Hanna Reitz
  27 siblings, 0 replies; 86+ messages in thread
From: Stefan Hajnoczi @ 2021-10-28 15:49 UTC (permalink / raw)
  To: Kevin Wolf, Hanna Reitz
  Cc: Fam Zheng, Emanuele Giuseppe Esposito,
	Vladimir Sementsov-Ogievskiy, Daniel P. Berrangé,
	Eduardo Habkost, qemu-block, Juan Quintela, qemu-devel,
	John Snow, Richard Henderson, Markus Armbruster,
	Dr. David Alan Gilbert, Paolo Bonzini, Eric Blake

[-- Attachment #1: Type: text/plain, Size: 9830 bytes --]

On Mon, Oct 25, 2021 at 06:17:10AM -0400, Emanuele Giuseppe Esposito wrote:
> Currently, block layer APIs like block-backend.h contain a mix of
> functions that are either running in the main loop and under the
> BQL, or are thread-safe functions and run in iothreads performing I/O.
> The functions running under BQL also take care of modifying the
> block graph, by using drain and/or aio_context_acquire/release.
> This makes it very confusing to understand where each function
> runs, and what assumptions it provided with regards to thread
> safety.
> 
> We call the functions running under BQL "global state (GS) API", and
> distinguish them from the thread-safe "I/O API".
> 
> The aim of this series is to split the relevant block headers in
> global state and I/O sub-headers. The division will be done in

Kevin and Hanna,
Does one of you want to review and merge this? It affects the entire
block layer and your input would be valuable.

Thanks,
Stefan

> this way:
> header.h will be split in header-global-state.h, header-io.h and
> header-common.h. The latter will just contain the data structures
> needed by header-global-state and header-io, and common helpers
> that are neither in GS nor in I/O. header.h will remain for
> legacy and to avoid changing all includes in all QEMU c files,
> but will only include the two new headers. No function shall be
> added in header.c .
> Once we split all relevant headers, it will be much easier to see what
> uses the AioContext lock and remove it, which is the overall main
> goal of this and other series that I posted/will post.
> 
> In addition to splitting the relevant headers shown in this series,
> it is also very helpful splitting the function pointers in some
> block structures, to understand what runs under AioContext lock and
> what doesn't. This is what patches 19-25 do.
> 
> Each function in the GS API will have an assertion, checking
> that it is always running under BQL.
> I/O functions are instead thread safe (or so should be), meaning
> that they *can* run under BQL, but also in an iothread in another
> AioContext. Therefore they do not provide any assertion, and
> need to be audited manually to verify the correctness.
> 
> Adding assetions has helped finding 2 bugs already, as shown in
> my series "Migration: fix missing iothread locking".
> 
> Tested this series by running unit tests, qemu-iotests and qtests
> (x86_64).
> Some functions in the GS API are used everywhere but not
> properly tested. Therefore their assertion is never actually run in
> the tests, so despite my very careful auditing, it is not impossible
> to exclude that some will trigger while actually using QEMU.
> 
> Patch 1 introduces qemu_in_main_thread(), the function used in
> all assertions. This had to be introduced otherwise all unit tests
> would fail, since they run in the main loop but use the code in
> stubs/iothread.c
> Patches 2-14 and 19-25 (with the exception of patch 9, that is an additional
> assert) are all structured in the same way: first we split the header
> and in the next (even) patch we add assertions.
> The rest of the patches ontain either both assertions and split,
> or have no assertions.
> 
> Next steps once this get reviewed:
> 1) audit the GS API and replace the AioContext lock with drains,
> or remove them when not necessary (requires further discussion).
> 2) [optional as it should be already the case] audit the I/O API
> and check that thread safety is guaranteed
> 
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> ---
> v3 -> v4:
> * blockdev.h (patch 14): blockdev_mark_auto_del, blockdev_auto_del
>   and blk_legacy_dinfo as GS API.
> * add copyright header to block.h, block-io.h and block-global-state.h
> * rebase on current master (c5b2f55981)
> 
> v2 -> v3:
> * rename "graph API" into "global state API"
> * change order of patches, block.h comes before block-backend.h
> * change GS and I/O comment headers to avoid redundancy, all other
>   headers refer to block-global-state.h and block-io.h
> * fix typo on GS and I/O headers
> * use assert instead of g_assert
> * move bdrv_pwrite_sync, bdrv_block_status and bdrv_co_copy_range_{from/to}
>   to the I/O API
> * change assert_bdrv_graph_writable implementation, since we need
>   to introduce additional drains
> * remove transactions API split
> * add preparation patch for blockdev.h (patch 13)
> * backup-top -> copy-on-write
> * change I/O comment in job.h into a better meaningful explanation
> * fix all warnings given by checkpatch, mostly due to /* */ to be
>   split in separate lines
> * rebase on current master (c09124dcb8), and split the following new functions:
> 	blk_replace_bs (I/O)
> 	bdrv_bsc_is_data (I/O)
> 	bdrv_bsc_invalidate_range (I/O)
> 	bdrv_bsc_fill (I/O)
> 	bdrv_new_open_driver_opts (GS)
> 	blk_get_max_hw_iov (I/O)
>   they are all added in patches 4 and 6.
> 
> v1 -> v2:
> * remove the iothread locking bug fix, and send it as separate patch
> * rename graph API -> global state API
> * better documented patch 1 (qemu_in_main_thread)
> * add and split all other block layer headers
> * fix warnings given by checkpatch on multiline comments
> 
> Emanuele Giuseppe Esposito (25):
>   main-loop.h: introduce qemu_in_main_thread()
>   include/block/block: split header into I/O and global state API
>   assertions for block global state API
>   include/sysemu/block-backend: split header into I/O and global state
>     (GS) API
>   block/block-backend.c: assertions for block-backend
>   include/block/block_int: split header into I/O and global state API
>   assertions for block_int global state API
>   block: introduce assert_bdrv_graph_writable
>   include/block/blockjob_int.h: split header into I/O and GS API
>   assertions for blockjob_int.h
>   include/block/blockjob.h: global state API
>   assertions for blockob.h global state API
>   include/sysemu/blockdev.h: move drive_add and inline drive_def
>   include/systemu/blockdev.h: global state API
>   assertions for blockdev.h global state API
>   include/block/snapshot: global state API + assertions
>   block/copy-before-write.h: global state API + assertions
>   block/coroutines: I/O API
>   block_int-common.h: split function pointers in BlockDriver
>   block_int-common.h: assertion in the callers of BlockDriver function
>     pointers
>   block_int-common.h: split function pointers in BdrvChildClass
>   block_int-common.h: assertions in the callers of BdrvChildClass
>     function pointers
>   block-backend-common.h: split function pointers in BlockDevOps
>   job.h: split function pointers in JobDriver
>   job.h: assertions in the callers of JobDriver funcion pointers
> 
>  block.c                                     |  188 ++-
>  block/backup.c                              |    1 +
>  block/block-backend.c                       |  105 +-
>  block/commit.c                              |    4 +
>  block/copy-before-write.c                   |    2 +
>  block/copy-before-write.h                   |    7 +
>  block/coroutines.h                          |    6 +
>  block/dirty-bitmap.c                        |    1 +
>  block/io.c                                  |   37 +
>  block/meson.build                           |    7 +-
>  block/mirror.c                              |    4 +
>  block/monitor/bitmap-qmp-cmds.c             |    6 +
>  block/monitor/block-hmp-cmds.c              |    2 +-
>  block/snapshot.c                            |   28 +
>  block/stream.c                              |    2 +
>  blockdev.c                                  |   55 +-
>  blockjob.c                                  |   14 +
>  include/block/block-common.h                |  389 +++++
>  include/block/block-global-state.h          |  286 ++++
>  include/block/block-io.h                    |  306 ++++
>  include/block/block.h                       |  878 +----------
>  include/block/block_int-common.h            | 1193 +++++++++++++++
>  include/block/block_int-global-state.h      |  327 ++++
>  include/block/block_int-io.h                |  163 ++
>  include/block/block_int.h                   | 1478 +------------------
>  include/block/blockjob.h                    |    9 +
>  include/block/blockjob_int.h                |   28 +
>  include/block/snapshot.h                    |   13 +-
>  include/qemu/job.h                          |   16 +
>  include/qemu/main-loop.h                    |   13 +
>  include/sysemu/block-backend-common.h       |   92 ++
>  include/sysemu/block-backend-global-state.h |  122 ++
>  include/sysemu/block-backend-io.h           |  139 ++
>  include/sysemu/block-backend.h              |  269 +---
>  include/sysemu/blockdev.h                   |   24 +-
>  job.c                                       |    9 +
>  migration/savevm.c                          |    2 +
>  softmmu/cpus.c                              |    5 +
>  softmmu/qdev-monitor.c                      |    2 +
>  softmmu/vl.c                                |   25 +-
>  stubs/iothread-lock.c                       |    5 +
>  41 files changed, 3619 insertions(+), 2643 deletions(-)
>  create mode 100644 include/block/block-common.h
>  create mode 100644 include/block/block-global-state.h
>  create mode 100644 include/block/block-io.h
>  create mode 100644 include/block/block_int-common.h
>  create mode 100644 include/block/block_int-global-state.h
>  create mode 100644 include/block/block_int-io.h
>  create mode 100644 include/sysemu/block-backend-common.h
>  create mode 100644 include/sysemu/block-backend-global-state.h
>  create mode 100644 include/sysemu/block-backend-io.h
> 
> -- 
> 2.27.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 02/25] include/block/block: split header into I/O and global state API
  2021-10-25 10:17 ` [PATCH v4 02/25] include/block/block: split header into I/O and global state API Emanuele Giuseppe Esposito
  2021-10-25 11:37   ` Philippe Mathieu-Daudé
@ 2021-11-11 15:00   ` Hanna Reitz
  2021-11-15 12:08     ` Emanuele Giuseppe Esposito
  2021-11-12 12:25   ` Hanna Reitz
  2 siblings, 1 reply; 86+ messages in thread
From: Hanna Reitz @ 2021-11-11 15:00 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
> block.h currently contains a mix of functions:
> some of them run under the BQL and modify the block layer graph,
> others are instead thread-safe and perform I/O in iothreads.
> It is not easy to understand which function is part of which
> group (I/O vs GS), and this patch aims to clarify it.
>
> The "GS" functions need the BQL, and often use
> aio_context_acquire/release and/or drain to be sure they
> can modify the graph safely.
> The I/O function are instead thread safe, and can run in
> any AioContext.
>
> By splitting the header in two files, block-io.h
> and block-global-state.h we have a clearer view on what
> needs what kind of protection. block-common.h
> contains common structures shared by both headers.
>
> block.h is left there for legacy and to avoid changing
> all includes in all c files that use the block APIs.
>
> Assertions are added in the next patch.
>
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   block.c                            |   3 +
>   block/meson.build                  |   7 +-
>   include/block/block-common.h       | 389 +++++++++++++
>   include/block/block-global-state.h | 286 ++++++++++
>   include/block/block-io.h           | 306 ++++++++++
>   include/block/block.h              | 878 +----------------------------
>   6 files changed, 1012 insertions(+), 857 deletions(-)
>   create mode 100644 include/block/block-common.h
>   create mode 100644 include/block/block-global-state.h
>   create mode 100644 include/block/block-io.h

[...]

> diff --git a/include/block/block-io.h b/include/block/block-io.h
> new file mode 100644
> index 0000000000..9af4609ccb
> --- /dev/null
> +++ b/include/block/block-io.h

[...]

> +/*
> + * I/O API functions. These functions are thread-safe, and therefore
> + * can run in any thread as long as the thread has called
> + * aio_context_acquire/release().
> + */
> +
> +int bdrv_replace_child_bs(BdrvChild *child, BlockDriverState *new_bs,
> +                          Error **errp);

Why is this function here?  Naïvely, I would’ve assumed as a 
graph-modifying function it should be in block-global-state.h.

I mean, perhaps it’s thread-safe and then it can fit here, too. Still, 
it surprises me a bit to find this here.

Hanna



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 03/25] assertions for block global state API
  2021-10-25 10:17 ` [PATCH v4 03/25] assertions for block " Emanuele Giuseppe Esposito
@ 2021-11-11 16:32   ` Hanna Reitz
  2021-11-15 12:27     ` Emanuele Giuseppe Esposito
  2021-11-12 11:31   ` Hanna Reitz
  1 sibling, 1 reply; 86+ messages in thread
From: Hanna Reitz @ 2021-11-11 16:32 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
> All the global state (GS) API functions will check that
> qemu_in_main_thread() returns true. If not, it means
> that the safety of BQL cannot be guaranteed, and
> they need to be moved to I/O.
>
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   block.c        | 136 +++++++++++++++++++++++++++++++++++++++++++++++--
>   block/commit.c |   2 +
>   block/io.c     |  20 ++++++++
>   blockdev.c     |   1 +
>   4 files changed, 156 insertions(+), 3 deletions(-)
>
> diff --git a/block.c b/block.c
> index 6fdb4d7712..672f946065 100644
> --- a/block.c
> +++ b/block.c

[...]

> @@ -5606,7 +5678,6 @@ int64_t bdrv_getlength(BlockDriverState *bs)
>   void bdrv_get_geometry(BlockDriverState *bs, uint64_t *nb_sectors_ptr)
>   {
>       int64_t nb_sectors = bdrv_nb_sectors(bs);
> -
>       *nb_sectors_ptr = nb_sectors < 0 ? 0 : nb_sectors;
>   }
>   

This hunk seems at least unrelated.

[...]

> @@ -5958,6 +6043,7 @@ const char *bdrv_get_parent_name(const BlockDriverState *bs)
>   /* TODO check what callers really want: bs->node_name or blk_name() */
>   const char *bdrv_get_device_name(const BlockDriverState *bs)
>   {
> +    assert(qemu_in_main_thread());
>       return bdrv_get_parent_name(bs) ?: "";
>   }
>   

This function is invoked from qcow2_signal_corruption(), which comes 
generally from an I/O path.  Is it safe to assert that we’re in the main 
thread here?

Well, the question is probably rather whether this needs really be a 
considered a global-state function, or whether putting it in common or 
I/O is fine.  I believe you’re right given that it invokes 
bdrv_get_parent_name(), it cannot be thread-safe, but then we’ll have to 
change qcow2_signal_corruption() so it doesn’t invoke this function.

[...]

> diff --git a/block/io.c b/block/io.c
> index bb0a254def..c5d7f8495e 100644
> --- a/block/io.c
> +++ b/block/io.c

[...]

> @@ -544,6 +546,7 @@ void bdrv_drained_end(BlockDriverState *bs)
>   
>   void bdrv_drained_end_no_poll(BlockDriverState *bs, int *drained_end_counter)
>   {
> +    assert(qemu_in_main_thread());
>       bdrv_do_drained_end(bs, false, NULL, false, drained_end_counter);
>   }

Why is bdrv_drained_end an I/O function and this is a GS function, even 
though it does just a subset?

> @@ -586,12 +589,14 @@ void bdrv_unapply_subtree_drain(BdrvChild *child, BlockDriverState *old_parent)
>   void coroutine_fn bdrv_co_drain(BlockDriverState *bs)
>   {
>       assert(qemu_in_coroutine());
> +    assert(qemu_in_main_thread());
>       bdrv_drained_begin(bs);
>       bdrv_drained_end(bs);
>   }
>   
>   void bdrv_drain(BlockDriverState *bs)
>   {
> +    assert(qemu_in_main_thread());
>       bdrv_drained_begin(bs);
>       bdrv_drained_end(bs);
>   }

Why are these GS functions when both bdrv_drained_begin() and 
bdrv_drained_end() are I/O functions?

I can understand making the drain_all functions GS functions, but it 
seems weird to say it’s an I/O function when a single BDS is drained via 
bdrv_drained_begin() and bdrv_drained_end(), but not via bdrv_drain(), 
which just does both.

(I can see that there are no I/O path callers, but I still find it strange.)

[...]

> @@ -2731,6 +2742,7 @@ int bdrv_block_status_above(BlockDriverState *bs, BlockDriverState *base,
>   int bdrv_block_status(BlockDriverState *bs, int64_t offset, int64_t bytes,
>                         int64_t *pnum, int64_t *map, BlockDriverState **file)
>   {
> +    assert(qemu_in_main_thread());
>       return bdrv_block_status_above(bs, bdrv_filter_or_cow_bs(bs),
>                                      offset, bytes, pnum, map, file);
>   }

Why is this a GS function as opposed to all other block-status 
functions?  Because of the bdrv_filter_or_cow_bs() call?

And isn’t the call from nvme_block_status_all() basically an I/O path?  
(Or is that always run in the main thread?)

> @@ -2800,6 +2812,7 @@ int bdrv_is_allocated_above(BlockDriverState *top,
>                               int64_t bytes, int64_t *pnum)
>   {
>       int depth;
> +
>       int ret = bdrv_common_block_status_above(top, base, include_base, false,
>                                                offset, bytes, pnum, NULL, NULL,
>                                                &depth);

This hunk too seems unrelated.

Hanna



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 04/25] include/sysemu/block-backend: split header into I/O and global state (GS) API
  2021-10-25 10:17 ` [PATCH v4 04/25] include/sysemu/block-backend: split header into I/O and global state (GS) API Emanuele Giuseppe Esposito
@ 2021-11-12 10:23   ` Hanna Reitz
  2021-11-16 10:16     ` Emanuele Giuseppe Esposito
  2021-11-12 12:30   ` Hanna Reitz
  1 sibling, 1 reply; 86+ messages in thread
From: Hanna Reitz @ 2021-11-12 10:23 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
> Similarly to the previous patches, split block-backend.h
> in block-backend-io.h and block-backend-global-state.h
>
> In addition, remove "block/block.h" include as it seems
> it is not necessary anymore, together with "qemu/iov.h"
>
> block-backend-common.h contains the structures shared between
> the two headers, and the functions that can't be categorized as
> I/O or global state.
>
> Assertions are added in the next patch.
>
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   block/block-backend.c                       |   9 +-
>   include/sysemu/block-backend-common.h       |  74 ++++++
>   include/sysemu/block-backend-global-state.h | 122 +++++++++
>   include/sysemu/block-backend-io.h           | 139 ++++++++++
>   include/sysemu/block-backend.h              | 269 +-------------------
>   5 files changed, 344 insertions(+), 269 deletions(-)
>   create mode 100644 include/sysemu/block-backend-common.h
>   create mode 100644 include/sysemu/block-backend-global-state.h
>   create mode 100644 include/sysemu/block-backend-io.h
>
> diff --git a/block/block-backend.c b/block/block-backend.c
> index 39cd99df2b..0afc03fd66 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -79,6 +79,7 @@ struct BlockBackend {
>       bool allow_aio_context_change;
>       bool allow_write_beyond_eof;
>   
> +    /* Protected by BQL lock */
>       NotifierList remove_bs_notifiers, insert_bs_notifiers;
>       QLIST_HEAD(, BlockBackendAioNotifier) aio_notifiers;
>   
> @@ -111,12 +112,14 @@ static const AIOCBInfo block_backend_aiocb_info = {
>   static void drive_info_del(DriveInfo *dinfo);
>   static BlockBackend *bdrv_first_blk(BlockDriverState *bs);
>   
> -/* All BlockBackends */
> +/* All BlockBackends. Protected by BQL lock. */
>   static QTAILQ_HEAD(, BlockBackend) block_backends =
>       QTAILQ_HEAD_INITIALIZER(block_backends);
>   
> -/* All BlockBackends referenced by the monitor and which are iterated through by
> - * blk_next() */
> +/*
> + * All BlockBackends referenced by the monitor and which are iterated through by
> + * blk_next(). Protected by BQL lock.
> + */
>   static QTAILQ_HEAD(, BlockBackend) monitor_block_backends =
>       QTAILQ_HEAD_INITIALIZER(monitor_block_backends);
>   
> diff --git a/include/sysemu/block-backend-common.h b/include/sysemu/block-backend-common.h
> new file mode 100644
> index 0000000000..52ff6a4d26
> --- /dev/null
> +++ b/include/sysemu/block-backend-common.h
> @@ -0,0 +1,74 @@
> +/*
> + * QEMU Block backends
> + *
> + * Copyright (C) 2014-2016 Red Hat, Inc.
> + *
> + * Authors:
> + *  Markus Armbruster <armbru@redhat.com>,
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2.1
> + * or later.  See the COPYING.LIB file in the top-level directory.
> + */
> +
> +#ifndef BLOCK_BACKEND_COMMON_H
> +#define BLOCK_BACKEND_COMMON_H
> +
> +#include "block/throttle-groups.h"
> +
> +/* Callbacks for block device models */
> +typedef struct BlockDevOps {
> +    /*
> +     * Runs when virtual media changed (monitor commands eject, change)
> +     * Argument load is true on load and false on eject.
> +     * Beware: doesn't run when a host device's physical media
> +     * changes.  Sure would be useful if it did.
> +     * Device models with removable media must implement this callback.
> +     */
> +    void (*change_media_cb)(void *opaque, bool load, Error **errp);
> +    /*
> +     * Runs when an eject request is issued from the monitor, the tray
> +     * is closed, and the medium is locked.
> +     * Device models that do not implement is_medium_locked will not need
> +     * this callback.  Device models that can lock the medium or tray might
> +     * want to implement the callback and unlock the tray when "force" is
> +     * true, even if they do not support eject requests.
> +     */
> +    void (*eject_request_cb)(void *opaque, bool force);
> +    /*
> +     * Is the virtual tray open?
> +     * Device models implement this only when the device has a tray.
> +     */
> +    bool (*is_tray_open)(void *opaque);
> +    /*
> +     * Is the virtual medium locked into the device?
> +     * Device models implement this only when device has such a lock.
> +     */
> +    bool (*is_medium_locked)(void *opaque);
> +    /*
> +     * Runs when the size changed (e.g. monitor command block_resize)
> +     */
> +    void (*resize_cb)(void *opaque);
> +    /*
> +     * Runs when the backend receives a drain request.
> +     */
> +    void (*drained_begin)(void *opaque);
> +    /*
> +     * Runs when the backend's last drain request ends.
> +     */
> +    void (*drained_end)(void *opaque);
> +    /*
> +     * Is the device still busy?
> +     */
> +    bool (*drained_poll)(void *opaque);
> +} BlockDevOps;
> +
> +/*
> + * This struct is embedded in (the private) BlockBackend struct and contains
> + * fields that must be public. This is in particular for QLIST_ENTRY() and
> + * friends so that BlockBackends can be kept in lists outside block-backend.c
> + */
> +typedef struct BlockBackendPublic {
> +    ThrottleGroupMember throttle_group_member;
> +} BlockBackendPublic;
> +
> +#endif /* BLOCK_BACKEND_COMMON_H */
> diff --git a/include/sysemu/block-backend-global-state.h b/include/sysemu/block-backend-global-state.h
> new file mode 100644
> index 0000000000..4001b1c02a
> --- /dev/null
> +++ b/include/sysemu/block-backend-global-state.h
> @@ -0,0 +1,122 @@
> +/*
> + * QEMU Block backends
> + *
> + * Copyright (C) 2014-2016 Red Hat, Inc.
> + *
> + * Authors:
> + *  Markus Armbruster <armbru@redhat.com>,
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2.1
> + * or later.  See the COPYING.LIB file in the top-level directory.
> + */
> +
> +#ifndef BLOCK_BACKEND_GS_H
> +#define BLOCK_BACKEND_GS_H
> +
> +#include "block-backend-common.h"
> +
> +/*
> + * Global state (GS) API. These functions run under the BQL lock.
> + *
> + * See include/block/block-global-state.h for more information about
> + * the GS API.
> + */
> +
> +BlockBackend *blk_new(AioContext *ctx, uint64_t perm, uint64_t shared_perm);
> +BlockBackend *blk_new_with_bs(BlockDriverState *bs, uint64_t perm,
> +                              uint64_t shared_perm, Error **errp);
> +BlockBackend *blk_new_open(const char *filename, const char *reference,
> +                           QDict *options, int flags, Error **errp);
> +int blk_get_refcnt(BlockBackend *blk);
> +void blk_ref(BlockBackend *blk);
> +void blk_unref(BlockBackend *blk);
> +void blk_remove_all_bs(void);
> +const char *blk_name(const BlockBackend *blk);

This is called by send_qmp_error_event(), which in turn is called by 
blk_error_action().  Are those strictly main loop functions?

> +BlockBackend *blk_by_name(const char *name);
> +BlockBackend *blk_next(BlockBackend *blk);
> +BlockBackend *blk_all_next(BlockBackend *blk);
> +bool monitor_add_blk(BlockBackend *blk, const char *name, Error **errp);
> +void monitor_remove_blk(BlockBackend *blk);
> +
> +BlockBackendPublic *blk_get_public(BlockBackend *blk);
> +BlockBackend *blk_by_public(BlockBackendPublic *public);
> +
> +void blk_remove_bs(BlockBackend *blk);
> +int blk_insert_bs(BlockBackend *blk, BlockDriverState *bs, Error **errp);
> +bool bdrv_has_blk(BlockDriverState *bs);
> +bool bdrv_is_root_node(BlockDriverState *bs);
> +int blk_set_perm(BlockBackend *blk, uint64_t perm, uint64_t shared_perm,
> +                 Error **errp);
> +void blk_get_perm(BlockBackend *blk, uint64_t *perm, uint64_t *shared_perm);

These functions are called from fuse_do_truncate(), which I believe runs 
in the context of the export’s BlockBackend.  I’m not saying that’s 
necessarily correct, but as of the next patch, this happens:

$ touch /tmp/fuse-export
$ storage-daemon/qemu-storage-daemon \
   --object iothread,id=iothr0 \
   --blockdev file,node-name=node0,filename=/tmp/fuse-export \
   --export 
fuse,id=exp0,node-name=node0,mountpoint=/tmp/fuse-export,iothread=iothr0,writable=true 
\
   &
[1] 27395
$ truncate /tmp/fuse-export -s 1M
qemu-storage-daemon: ../block/block-backend.c:935: blk_get_perm: 
Assertion `qemu_in_main_thread()' failed.
truncate: failed to truncate '/tmp/fuse-export' at 1048576 bytes: 
Software caused connection abort
truncate: failed to close '/tmp/fuse-export': Transport endpoint is not 
connected
[1]  + 27395 IOT instruction (core dumped) 
storage-daemon/qemu-storage-daemon --object iothread,id=iothr0 --blockdev

> +
> +void blk_iostatus_enable(BlockBackend *blk);
> +bool blk_iostatus_is_enabled(const BlockBackend *blk);
> +BlockDeviceIoStatus blk_iostatus(const BlockBackend *blk);
> +void blk_iostatus_disable(BlockBackend *blk);
> +void blk_iostatus_reset(BlockBackend *blk);
> +void blk_iostatus_set_err(BlockBackend *blk, int error);
> +int blk_attach_dev(BlockBackend *blk, DeviceState *dev);
> +void blk_detach_dev(BlockBackend *blk, DeviceState *dev);
> +DeviceState *blk_get_attached_dev(BlockBackend *blk);
> +char *blk_get_attached_dev_id(BlockBackend *blk);
> +BlockBackend *blk_by_dev(void *dev);
> +BlockBackend *blk_by_qdev_id(const char *id, Error **errp);
> +void blk_set_dev_ops(BlockBackend *blk, const BlockDevOps *ops, void *opaque);
> +
> +int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags);
> +int64_t blk_nb_sectors(BlockBackend *blk);

I’d have considered this an I/O function, and blk_getlength() is 
classified as such.  Why not this?

> +int blk_commit_all(void);
> +void blk_drain(BlockBackend *blk);

I’m again wondering a bit why this is a GS function, when 
bdrv_drained_begin() is an I/O function.  However, less than in the case 
of bdrv_drain(), given that there is no blk_drained_begin() that’s 
classified as I/O.

> +void blk_drain_all(void);
> +void blk_set_on_error(BlockBackend *blk, BlockdevOnError on_read_error,
> +                      BlockdevOnError on_write_error);
> +bool blk_supports_write_perm(BlockBackend *blk);
> +bool blk_is_sg(BlockBackend *blk);
> +bool blk_enable_write_cache(BlockBackend *blk);
> +void blk_set_enable_write_cache(BlockBackend *blk, bool wce);
> +void blk_lock_medium(BlockBackend *blk, bool locked);
> +void blk_eject(BlockBackend *blk, bool eject_flag);
> +int blk_get_flags(BlockBackend *blk);
> +void blk_set_guest_block_size(BlockBackend *blk, int align);
> +bool blk_op_is_blocked(BlockBackend *blk, BlockOpType op, Error **errp);
> +void blk_op_unblock(BlockBackend *blk, BlockOpType op, Error *reason);
> +void blk_op_block_all(BlockBackend *blk, Error *reason);
> +void blk_op_unblock_all(BlockBackend *blk, Error *reason);
> +int blk_set_aio_context(BlockBackend *blk, AioContext *new_context,
> +                        Error **errp);
> +void blk_add_aio_context_notifier(BlockBackend *blk,
> +        void (*attached_aio_context)(AioContext *new_context, void *opaque),
> +        void (*detach_aio_context)(void *opaque), void *opaque);
> +void blk_remove_aio_context_notifier(BlockBackend *blk,
> +                                     void (*attached_aio_context)(AioContext *,
> +                                                                  void *),
> +                                     void (*detach_aio_context)(void *),
> +                                     void *opaque);
> +void blk_add_remove_bs_notifier(BlockBackend *blk, Notifier *notify);
> +void blk_add_insert_bs_notifier(BlockBackend *blk, Notifier *notify);
> +BlockBackendRootState *blk_get_root_state(BlockBackend *blk);
> +void blk_update_root_state(BlockBackend *blk);
> +bool blk_get_detect_zeroes_from_root_state(BlockBackend *blk);
> +int blk_get_open_flags_from_root_state(BlockBackend *blk);
> +
> +int blk_save_vmstate(BlockBackend *blk, const uint8_t *buf,
> +                     int64_t pos, int size);
> +int blk_load_vmstate(BlockBackend *blk, uint8_t *buf, int64_t pos, int size);
> +int blk_probe_blocksizes(BlockBackend *blk, BlockSizes *bsz);
> +int blk_probe_geometry(BlockBackend *blk, HDGeometry *geo);
> +BlockAIOCB *blk_abort_aio_request(BlockBackend *blk,
> +                                  BlockCompletionFunc *cb,
> +                                  void *opaque, int ret);

This sounds more like an I/O function to me.

> +
> +void blk_set_io_limits(BlockBackend *blk, ThrottleConfig *cfg);
> +void blk_io_limits_disable(BlockBackend *blk);
> +void blk_io_limits_enable(BlockBackend *blk, const char *group);
> +void blk_io_limits_update_group(BlockBackend *blk, const char *group);
> +void blk_set_force_allow_inactivate(BlockBackend *blk);
> +
> +void blk_register_buf(BlockBackend *blk, void *host, size_t size);
> +void blk_unregister_buf(BlockBackend *blk, void *host);
> +
> +const BdrvChild *blk_root(BlockBackend *blk);
> +
> +#endif /* BLOCK_BACKEND_GS_H */
> diff --git a/include/sysemu/block-backend-io.h b/include/sysemu/block-backend-io.h
> new file mode 100644
> index 0000000000..ab0463cb69
> --- /dev/null
> +++ b/include/sysemu/block-backend-io.h
> @@ -0,0 +1,139 @@
> +/*
> + * QEMU Block backends
> + *
> + * Copyright (C) 2014-2016 Red Hat, Inc.
> + *
> + * Authors:
> + *  Markus Armbruster <armbru@redhat.com>,
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2.1
> + * or later.  See the COPYING.LIB file in the top-level directory.
> + */
> +
> +#ifndef BLOCK_BACKEND_IO_H
> +#define BLOCK_BACKEND_IO_H
> +
> +#include "block-backend-common.h"
> +
> +/*
> + * I/O API functions. These functions are thread-safe.
> + *
> + * See include/block/block-io.h for more information about
> + * the I/O API.
> + */
> +
> +BlockDriverState *blk_bs(BlockBackend *blk);
> +
> +int blk_replace_bs(BlockBackend *blk, BlockDriverState *new_bs, Error **errp);

This sounds like a GS function to me.

> +
> +void blk_set_allow_write_beyond_eof(BlockBackend *blk, bool allow);
> +void blk_set_allow_aio_context_change(BlockBackend *blk, bool allow);
> +void blk_set_disable_request_queuing(BlockBackend *blk, bool disable);
> +
> +int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int bytes);
> +int blk_pwrite(BlockBackend *blk, int64_t offset, const void *buf, int bytes,
> +               BdrvRequestFlags flags);
> +int coroutine_fn blk_co_preadv(BlockBackend *blk, int64_t offset,
> +                               int64_t bytes, QEMUIOVector *qiov,
> +                               BdrvRequestFlags flags);
> +int coroutine_fn blk_co_pwritev_part(BlockBackend *blk, int64_t offset,
> +                                     int64_t bytes,
> +                                     QEMUIOVector *qiov, size_t qiov_offset,
> +                                     BdrvRequestFlags flags);
> +int coroutine_fn blk_co_pwritev(BlockBackend *blk, int64_t offset,
> +                                int64_t bytes, QEMUIOVector *qiov,
> +                                BdrvRequestFlags flags);
> +
> +static inline int coroutine_fn blk_co_pread(BlockBackend *blk, int64_t offset,
> +                                            int64_t bytes, void *buf,
> +                                            BdrvRequestFlags flags)
> +{
> +    QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
> +
> +    assert(bytes <= SIZE_MAX);
> +
> +    return blk_co_preadv(blk, offset, bytes, &qiov, flags);
> +}
> +
> +static inline int coroutine_fn blk_co_pwrite(BlockBackend *blk, int64_t offset,
> +                                             int64_t bytes, void *buf,
> +                                             BdrvRequestFlags flags)
> +{
> +    QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
> +
> +    assert(bytes <= SIZE_MAX);
> +
> +    return blk_co_pwritev(blk, offset, bytes, &qiov, flags);
> +}
> +
> +BlockAIOCB *blk_aio_pwrite_zeroes(BlockBackend *blk, int64_t offset,
> +                                  int64_t bytes, BdrvRequestFlags flags,
> +                                  BlockCompletionFunc *cb, void *opaque);
> +
> +BlockAIOCB *blk_aio_preadv(BlockBackend *blk, int64_t offset,
> +                           QEMUIOVector *qiov, BdrvRequestFlags flags,
> +                           BlockCompletionFunc *cb, void *opaque);
> +BlockAIOCB *blk_aio_pwritev(BlockBackend *blk, int64_t offset,
> +                            QEMUIOVector *qiov, BdrvRequestFlags flags,
> +                            BlockCompletionFunc *cb, void *opaque);
> +BlockAIOCB *blk_aio_flush(BlockBackend *blk,
> +                          BlockCompletionFunc *cb, void *opaque);
> +BlockAIOCB *blk_aio_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes,
> +                             BlockCompletionFunc *cb, void *opaque);
> +void blk_aio_cancel(BlockAIOCB *acb);
> +void blk_aio_cancel_async(BlockAIOCB *acb);
> +int blk_ioctl(BlockBackend *blk, unsigned long int req, void *buf);
> +BlockAIOCB *blk_aio_ioctl(BlockBackend *blk, unsigned long int req, void *buf,
> +                          BlockCompletionFunc *cb, void *opaque);
> +int coroutine_fn blk_co_pdiscard(BlockBackend *blk, int64_t offset,
> +                                 int64_t bytes);
> +int coroutine_fn blk_co_flush(BlockBackend *blk);
> +int blk_flush(BlockBackend *blk);
> +void blk_inc_in_flight(BlockBackend *blk);
> +void blk_dec_in_flight(BlockBackend *blk);
> +bool blk_is_inserted(BlockBackend *blk);
> +bool blk_is_available(BlockBackend *blk);
> +int64_t blk_getlength(BlockBackend *blk);
> +void blk_get_geometry(BlockBackend *blk, uint64_t *nb_sectors_ptr);
> +void *blk_try_blockalign(BlockBackend *blk, size_t size);
> +void *blk_blockalign(BlockBackend *blk, size_t size);
> +bool blk_is_writable(BlockBackend *blk);
> +BlockdevOnError blk_get_on_error(BlockBackend *blk, bool is_read);
> +BlockErrorAction blk_get_error_action(BlockBackend *blk, bool is_read,
> +                                      int error);
> +void blk_error_action(BlockBackend *blk, BlockErrorAction action,
> +                      bool is_read, int error);
> +int blk_get_max_iov(BlockBackend *blk);
> +int blk_get_max_hw_iov(BlockBackend *blk);
> +
> +void blk_invalidate_cache(BlockBackend *blk, Error **errp);
> +
> +void blk_io_plug(BlockBackend *blk);
> +void blk_io_unplug(BlockBackend *blk);
> +AioContext *blk_get_aio_context(BlockBackend *blk);
> +BlockAcctStats *blk_get_stats(BlockBackend *blk);
> +void *blk_aio_get(const AIOCBInfo *aiocb_info, BlockBackend *blk,
> +                  BlockCompletionFunc *cb, void *opaque);
> +int blk_pwrite_compressed(BlockBackend *blk, int64_t offset, const void *buf,
> +                          int64_t bytes);
> +int blk_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes);
> +int blk_pwrite_zeroes(BlockBackend *blk, int64_t offset,
> +                      int64_t bytes, BdrvRequestFlags flags);
> +int coroutine_fn blk_co_pwrite_zeroes(BlockBackend *blk, int64_t offset,
> +                                      int64_t bytes, BdrvRequestFlags flags);
> +int blk_truncate(BlockBackend *blk, int64_t offset, bool exact,
> +                 PreallocMode prealloc, BdrvRequestFlags flags, Error **errp);
> +
> +uint32_t blk_get_request_alignment(BlockBackend *blk);
> +uint32_t blk_get_max_transfer(BlockBackend *blk);
> +uint64_t blk_get_max_hw_transfer(BlockBackend *blk);
> +
> +int coroutine_fn blk_co_copy_range(BlockBackend *blk_in, int64_t off_in,
> +                                   BlockBackend *blk_out, int64_t off_out,
> +                                   int64_t bytes, BdrvRequestFlags read_flags,
> +                                   BdrvRequestFlags write_flags);
> +
> +
> +int blk_make_empty(BlockBackend *blk, Error **errp);

bdrv_make_empty() (called by this function) is classified as a GS 
Function, and asserts that it’s running in the main thread.  I can also 
see that the next patch adds the same assertion to blk_make_empty(), so 
I believe this should be a GS function, too.

Hanna



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 05/25] block/block-backend.c: assertions for block-backend
  2021-10-25 10:17 ` [PATCH v4 05/25] block/block-backend.c: assertions for block-backend Emanuele Giuseppe Esposito
@ 2021-11-12 11:01   ` Hanna Reitz
  2021-11-16 10:15     ` Emanuele Giuseppe Esposito
  0 siblings, 1 reply; 86+ messages in thread
From: Hanna Reitz @ 2021-11-12 11:01 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
> All the global state (GS) API functions will check that
> qemu_in_main_thread() returns true. If not, it means
> that the safety of BQL cannot be guaranteed, and
> they need to be moved to I/O.
>
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   block/block-backend.c  | 90 +++++++++++++++++++++++++++++++++++++++++-
>   softmmu/qdev-monitor.c |  2 +
>   2 files changed, 91 insertions(+), 1 deletion(-)
>
> diff --git a/block/block-backend.c b/block/block-backend.c
> index 0afc03fd66..ed45576007 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c

[...]

> @@ -1550,6 +1596,7 @@ BlockAIOCB *blk_aio_pwritev(BlockBackend *blk, int64_t offset,
>   
>   void blk_aio_cancel(BlockAIOCB *acb)
>   {
> +    assert(qemu_in_main_thread());
>       bdrv_aio_cancel(acb);
>   }
>   

This function is in block-backend-io.h, though.

[...]

> @@ -1879,7 +1936,6 @@ void blk_invalidate_cache(BlockBackend *blk, Error **errp)
>   bool blk_is_inserted(BlockBackend *blk)
>   {
>       BlockDriverState *bs = blk_bs(blk);
> -
>       return bs && bdrv_is_inserted(bs);
>   }

Seems like an unrelated hunk.

[...]

> @@ -2443,11 +2529,13 @@ int coroutine_fn blk_co_copy_range(BlockBackend *blk_in, int64_t off_in,

[…]

>   int blk_make_empty(BlockBackend *blk, Error **errp)
>   {
> +    assert(qemu_in_main_thread());
>       if (!blk_is_available(blk)) {
>           error_setg(errp, "No medium inserted");
>           return -ENOMEDIUM;

This function too is in block-backend-io.h.

Hanna



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 03/25] assertions for block global state API
  2021-10-25 10:17 ` [PATCH v4 03/25] assertions for block " Emanuele Giuseppe Esposito
  2021-11-11 16:32   ` Hanna Reitz
@ 2021-11-12 11:31   ` Hanna Reitz
  1 sibling, 0 replies; 86+ messages in thread
From: Hanna Reitz @ 2021-11-12 11:31 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
> All the global state (GS) API functions will check that
> qemu_in_main_thread() returns true. If not, it means
> that the safety of BQL cannot be guaranteed, and
> they need to be moved to I/O.
>
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   block.c        | 136 +++++++++++++++++++++++++++++++++++++++++++++++--
>   block/commit.c |   2 +
>   block/io.c     |  20 ++++++++
>   blockdev.c     |   1 +
>   4 files changed, 156 insertions(+), 3 deletions(-)

bdrv_make_zero() seems missing here – it can be considered an I/O or a 
GS function, but patch 2 classified it as GS.

Hanna



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 06/25] include/block/block_int: split header into I/O and global state API
  2021-10-25 10:17 ` [PATCH v4 06/25] include/block/block_int: split header into I/O and global state API Emanuele Giuseppe Esposito
@ 2021-11-12 12:17   ` Hanna Reitz
  2021-11-16 10:24     ` Emanuele Giuseppe Esposito
  0 siblings, 1 reply; 86+ messages in thread
From: Hanna Reitz @ 2021-11-12 12:17 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
> Similarly to the previous patch, split block_int.h
> in block_int-io.h and block_int-global-state.h
>
> block_int-common.h contains the structures shared between
> the two headers, and the functions that can't be categorized as
> I/O or global state.
>
> Assertions are added in the next patch.
>
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   blockdev.c                             |    5 +
>   include/block/block_int-common.h       | 1164 +++++++++++++++++++
>   include/block/block_int-global-state.h |  319 +++++
>   include/block/block_int-io.h           |  163 +++
>   include/block/block_int.h              | 1478 +-----------------------
>   5 files changed, 1654 insertions(+), 1475 deletions(-)
>   create mode 100644 include/block/block_int-common.h
>   create mode 100644 include/block/block_int-global-state.h
>   create mode 100644 include/block/block_int-io.h

[...]

> diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
> new file mode 100644
> index 0000000000..79a3d801d2
> --- /dev/null
> +++ b/include/block/block_int-common.h

[...]

> +struct BlockDriver {

[...]

> +    /**
> +     * Try to get @bs's logical and physical block size.
> +     * On success, store them in @bsz and return zero.
> +     * On failure, return negative errno.
> +     */
> +    /* I/O API, even though if it's a filter jumps on parent */

I don’t understand this...

> +    int (*bdrv_probe_blocksizes)(BlockDriverState *bs, BlockSizes *bsz);
> +    /**
> +     * Try to get @bs's geometry (cyls, heads, sectors)
> +     * On success, store them in @geo and return 0.
> +     * On failure return -errno.
> +     * Only drivers that want to override guest geometry implement this
> +     * callback; see hd_geometry_guess().
> +     */
> +    /* I/O API, even though if it's a filter jumps on parent */

...or this comment.  bdrv_probe_blocksizes() and bdrv_probe_geometry() 
are in block-global-state.h, so why are these methods part of the I/O 
API?  (And I’m afraid I can’t parse “even though if it’s a filter jumps 
on parent”.)

Hanna

> +    int (*bdrv_probe_geometry)(BlockDriverState *bs, HDGeometry *geo);



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 02/25] include/block/block: split header into I/O and global state API
  2021-10-25 10:17 ` [PATCH v4 02/25] include/block/block: split header into I/O and global state API Emanuele Giuseppe Esposito
  2021-10-25 11:37   ` Philippe Mathieu-Daudé
  2021-11-11 15:00   ` Hanna Reitz
@ 2021-11-12 12:25   ` Hanna Reitz
  2021-11-16 14:00     ` Emanuele Giuseppe Esposito
  2 siblings, 1 reply; 86+ messages in thread
From: Hanna Reitz @ 2021-11-12 12:25 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
> block.h currently contains a mix of functions:
> some of them run under the BQL and modify the block layer graph,
> others are instead thread-safe and perform I/O in iothreads.
> It is not easy to understand which function is part of which
> group (I/O vs GS), and this patch aims to clarify it.
>
> The "GS" functions need the BQL, and often use
> aio_context_acquire/release and/or drain to be sure they
> can modify the graph safely.
> The I/O function are instead thread safe, and can run in
> any AioContext.
>
> By splitting the header in two files, block-io.h
> and block-global-state.h we have a clearer view on what
> needs what kind of protection. block-common.h
> contains common structures shared by both headers.
>
> block.h is left there for legacy and to avoid changing
> all includes in all c files that use the block APIs.
>
> Assertions are added in the next patch.
>
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   block.c                            |   3 +
>   block/meson.build                  |   7 +-
>   include/block/block-common.h       | 389 +++++++++++++
>   include/block/block-global-state.h | 286 ++++++++++
>   include/block/block-io.h           | 306 ++++++++++
>   include/block/block.h              | 878 +----------------------------
>   6 files changed, 1012 insertions(+), 857 deletions(-)
>   create mode 100644 include/block/block-common.h
>   create mode 100644 include/block/block-global-state.h
>   create mode 100644 include/block/block-io.h

[...]

> diff --git a/include/block/block-common.h b/include/block/block-common.h
> new file mode 100644
> index 0000000000..4f1fd8de21
> --- /dev/null
> +++ b/include/block/block-common.h

[...]

> +#define BLKDBG_EVENT(child, evt) \
> +    do { \
> +        if (child) { \
> +            bdrv_debug_event(child->bs, evt); \
> +        } \
> +    } while (0)

This is defined twice, once here, and...

> diff --git a/include/block/block-io.h b/include/block/block-io.h
> new file mode 100644
> index 0000000000..9af4609ccb
> --- /dev/null
> +++ b/include/block/block-io.h

[...]

> +#define BLKDBG_EVENT(child, evt) \
> +    do { \
> +        if (child) { \
> +            bdrv_debug_event(child->bs, evt); \
> +        } \
> +    } while (0)

...once here.

[...]

> +/**
> + * bdrv_drained_begin:
> + *
> + * Begin a quiesced section for exclusive access to the BDS, by disabling
> + * external request sources including NBD server and device model. Note that
> + * this doesn't block timers or coroutines from submitting more requests, which
> + * means block_job_pause is still necessary.

Where does this sentence come from?  I can’t see it in master or in the 
lines removed from block.h:

> + *
> + * This function can be recursive.
> + */
> +void bdrv_drained_begin(BlockDriverState *bs);

[...]

> diff --git a/include/block/block.h b/include/block/block.h
> index e5dd22b034..1e6b8fef1e 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h

[...]

> -/**
> - * bdrv_drained_begin:
> - *
> - * Begin a quiesced section for exclusive access to the BDS, by disabling
> - * external request sources including NBD server, block jobs, and device model.
> - *
> - * This function can be recursive.
> - */
> -void bdrv_drained_begin(BlockDriverState *bs);



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 04/25] include/sysemu/block-backend: split header into I/O and global state (GS) API
  2021-10-25 10:17 ` [PATCH v4 04/25] include/sysemu/block-backend: split header into I/O and global state (GS) API Emanuele Giuseppe Esposito
  2021-11-12 10:23   ` Hanna Reitz
@ 2021-11-12 12:30   ` Hanna Reitz
  2021-11-16 14:24     ` Emanuele Giuseppe Esposito
  1 sibling, 1 reply; 86+ messages in thread
From: Hanna Reitz @ 2021-11-12 12:30 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
> Similarly to the previous patches, split block-backend.h
> in block-backend-io.h and block-backend-global-state.h
>
> In addition, remove "block/block.h" include as it seems
> it is not necessary anymore, together with "qemu/iov.h"
>
> block-backend-common.h contains the structures shared between
> the two headers, and the functions that can't be categorized as
> I/O or global state.
>
> Assertions are added in the next patch.
>
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   block/block-backend.c                       |   9 +-
>   include/sysemu/block-backend-common.h       |  74 ++++++
>   include/sysemu/block-backend-global-state.h | 122 +++++++++
>   include/sysemu/block-backend-io.h           | 139 ++++++++++
>   include/sysemu/block-backend.h              | 269 +-------------------
>   5 files changed, 344 insertions(+), 269 deletions(-)
>   create mode 100644 include/sysemu/block-backend-common.h
>   create mode 100644 include/sysemu/block-backend-global-state.h
>   create mode 100644 include/sysemu/block-backend-io.h

[...]

> diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
> index e5e1524f06..038be9fc40 100644
> --- a/include/sysemu/block-backend.h
> +++ b/include/sysemu/block-backend.h
> @@ -13,272 +13,9 @@
>   #ifndef BLOCK_BACKEND_H
>   #define BLOCK_BACKEND_H
>   
> -#include "qemu/iov.h"
> -#include "block/throttle-groups.h"
> +#include "block-backend-global-state.h"
> +#include "block-backend-io.h"
>   
> -/*
> - * TODO Have to include block/block.h for a bunch of block layer
> - * types.  Unfortunately, this pulls in the whole BlockDriverState
> - * API, which we don't want used by many BlockBackend users.  Some of
> - * the types belong here, and the rest should be split into a common
> - * header and one for the BlockDriverState API.
> - */
> -#include "block/block.h"

This note and the include is gone.  Sounds like something positive, but 
why is this possible?

Hanna



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 07/25] assertions for block_int global state API
  2021-10-25 10:17 ` [PATCH v4 07/25] assertions for block_int " Emanuele Giuseppe Esposito
@ 2021-11-12 13:51   ` Hanna Reitz
  2021-11-16 15:43     ` Emanuele Giuseppe Esposito
  0 siblings, 1 reply; 86+ messages in thread
From: Hanna Reitz @ 2021-11-12 13:51 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   block.c                         | 17 +++++++++++++++++
>   block/backup.c                  |  1 +
>   block/block-backend.c           |  3 +++
>   block/commit.c                  |  2 ++
>   block/dirty-bitmap.c            |  1 +
>   block/io.c                      |  6 ++++++
>   block/mirror.c                  |  4 ++++
>   block/monitor/bitmap-qmp-cmds.c |  6 ++++++
>   block/stream.c                  |  2 ++
>   blockdev.c                      |  7 +++++++
>   10 files changed, 49 insertions(+)
>
> diff --git a/block.c b/block.c
> index 672f946065..41c5883c5c 100644
> --- a/block.c
> +++ b/block.c

[...]

> @@ -7473,6 +7488,7 @@ static bool append_strong_runtime_options(QDict *d, BlockDriverState *bs)
>    * would result in exactly bs->backing. */
>   bool bdrv_backing_overridden(BlockDriverState *bs)
>   {
> +    assert(qemu_in_main_thread());
>       if (bs->backing) {
>           return strcmp(bs->auto_backing_file,
>                         bs->backing->bs->filename);

This function is in block_int-common.h, though.

[...]

> diff --git a/block/io.c b/block/io.c
> index c5d7f8495e..f271ab3684 100644
> --- a/block/io.c
> +++ b/block/io.c

[...]

> @@ -3419,6 +3423,7 @@ int coroutine_fn bdrv_co_copy_range_from(BdrvChild *src, int64_t src_offset,
>   {
>       trace_bdrv_co_copy_range_from(src, src_offset, dst, dst_offset, bytes,
>                                     read_flags, write_flags);
> +    assert(qemu_in_main_thread());
>       return bdrv_co_copy_range_internal(src, src_offset, dst, dst_offset,
>                                          bytes, read_flags, write_flags, true);
>   }

This is a block_int-io.h function.

> @@ -3435,6 +3440,7 @@ int coroutine_fn bdrv_co_copy_range_to(BdrvChild *src, int64_t src_offset,
>   {
>       trace_bdrv_co_copy_range_to(src, src_offset, dst, dst_offset, bytes,
>                                   read_flags, write_flags);
> +    assert(qemu_in_main_thread());
>       return bdrv_co_copy_range_internal(src, src_offset, dst, dst_offset,
>                                          bytes, read_flags, write_flags, false);
>   }

This, too.

Hanna



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 08/25] block: introduce assert_bdrv_graph_writable
  2021-10-25 10:17 ` [PATCH v4 08/25] block: introduce assert_bdrv_graph_writable Emanuele Giuseppe Esposito
@ 2021-11-12 14:40   ` Hanna Reitz
  2021-11-18  9:55     ` Emanuele Giuseppe Esposito
  0 siblings, 1 reply; 86+ messages in thread
From: Hanna Reitz @ 2021-11-12 14:40 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
> We want to be sure that the functions that write the child and
> parent list of a bs are under BQL and drain.
>
> BQL prevents from concurrent writings from the GS API, while
> drains protect from I/O.
>
> TODO: drains are missing in some functions using this assert.
> Therefore a proper assertion will fail. Because adding drains
> requires additional discussions, they will be added in future
> series.
>
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   block.c                                |  5 +++++
>   block/io.c                             | 11 +++++++++++
>   include/block/block_int-global-state.h | 10 +++++++++-
>   3 files changed, 25 insertions(+), 1 deletion(-)
>
> diff --git a/block.c b/block.c
> index 41c5883c5c..94bff5c757 100644
> --- a/block.c
> +++ b/block.c
> @@ -2734,12 +2734,14 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
>           if (child->klass->detach) {
>               child->klass->detach(child);
>           }
> +        assert_bdrv_graph_writable(old_bs);
>           QLIST_REMOVE(child, next_parent);

I think this belongs above the .detach() call (and the QLIST_REMOVE() 
belongs into the .detach() implementation, as done in 
https://lists.nongnu.org/archive/html/qemu-block/2021-11/msg00240.html, 
which has been merged to Kevin’s block branch).

>       }
>   
>       child->bs = new_bs;
>   
>       if (new_bs) {
> +        assert_bdrv_graph_writable(new_bs);
>           QLIST_INSERT_HEAD(&new_bs->parents, child, next_parent);

In both these places it’s a bit strange that the assertion is done on 
the child nodes.  The subgraph starting from them isn’t modified after 
all, so their subgraph technically doesn’t need to be writable.  I think 
a single assertion on the parent node would be preferable.

I presume the problem with that is that we don’t have the parent node 
here?  Do we need a new BdrvChildClass method that performs this 
assertion on the parent node?

>   
>           /*
> @@ -2940,6 +2942,7 @@ static int bdrv_attach_child_noperm(BlockDriverState *parent_bs,
>           return ret;
>       }
>   
> +    assert_bdrv_graph_writable(parent_bs);
>       QLIST_INSERT_HEAD(&parent_bs->children, *child, next);
>       /*
>        * child is removed in bdrv_attach_child_common_abort(), so don't care to
> @@ -3140,6 +3143,7 @@ static void bdrv_unset_inherits_from(BlockDriverState *root, BdrvChild *child,
>   void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child)
>   {
>       assert(qemu_in_main_thread());
> +    assert_bdrv_graph_writable(parent);

It looks to me like we have this assertion mainly because 
bdrv_replace_child_noperm() doesn’t have a pointer to this parent node.  
It’s a workaround, but we should have this in every path that eventually 
ends up at bdrv_replace_child_noperm(), and that seems rather difficult 
for the bdrv_replace_node() family of functions. That to me sounds like 
it’d be good to have this as a BdrvChildClass function.

>       if (child == NULL) {
>           return;
>       }
> @@ -4903,6 +4907,7 @@ static void bdrv_remove_filter_or_cow_child_abort(void *opaque)
>       BdrvRemoveFilterOrCowChild *s = opaque;
>       BlockDriverState *parent_bs = s->child->opaque;
>   
> +    assert_bdrv_graph_writable(parent_bs);
>       QLIST_INSERT_HEAD(&parent_bs->children, s->child, next);
>       if (s->is_backing) {
>           parent_bs->backing = s->child;
> diff --git a/block/io.c b/block/io.c
> index f271ab3684..1c71e354d6 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -740,6 +740,17 @@ void bdrv_drain_all(void)
>       bdrv_drain_all_end();
>   }
>   
> +void assert_bdrv_graph_writable(BlockDriverState *bs)
> +{
> +    /*
> +     * TODO: this function is incomplete. Because the users of this
> +     * assert lack the necessary drains, check only for BQL.
> +     * Once the necessary drains are added,
> +     * assert also for qatomic_read(&bs->quiesce_counter) > 0
> +     */
> +    assert(qemu_in_main_thread());
> +}
> +
>   /**
>    * Remove an active request from the tracked requests list
>    *
> diff --git a/include/block/block_int-global-state.h b/include/block/block_int-global-state.h
> index d08e80222c..6bd7746409 100644
> --- a/include/block/block_int-global-state.h
> +++ b/include/block/block_int-global-state.h
> @@ -316,4 +316,12 @@ void bdrv_remove_aio_context_notifier(BlockDriverState *bs,
>    */
>   void bdrv_drain_all_end_quiesce(BlockDriverState *bs);
>   
> -#endif /* BLOCK_INT_GLOBAL_STATE*/
> +/**
> + * Make sure that the function is either running under
> + * drain and BQL. The latter protects from concurrent writings

“either ... and” sounds wrong to me.  I’d drop the “either” or say 
“running under both drain and BQL”.

Hanna

> + * from the GS API, while the former prevents concurrent reads
> + * from I/O.
> + */
> +void assert_bdrv_graph_writable(BlockDriverState *bs);
> +
> +#endif /* BLOCK_INT_GLOBAL_STATE */



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 10/25] assertions for blockjob_int.h
  2021-10-25 10:17 ` [PATCH v4 10/25] assertions for blockjob_int.h Emanuele Giuseppe Esposito
@ 2021-11-12 15:17   ` Hanna Reitz
  0 siblings, 0 replies; 86+ messages in thread
From: Hanna Reitz @ 2021-11-12 15:17 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   blockjob.c | 4 ++++
>   1 file changed, 4 insertions(+)
>
> diff --git a/blockjob.c b/blockjob.c
> index 4bad1408cb..fbd6c7d873 100644
> --- a/blockjob.c
> +++ b/blockjob.c
> @@ -83,6 +83,7 @@ BlockJob *block_job_get(const char *id)
>   
>   void block_job_free(Job *job)
>   {
> +    assert(qemu_in_main_thread());
>       BlockJob *bjob = container_of(job, BlockJob, job);

Our coding style (docs/devel/style.rst) requires all statements to come 
after all declarations in a block, so the assert() may not precede the 
bjob declaration.

>   
>       block_job_remove_all_bdrv(bjob);
> @@ -436,6 +437,8 @@ void *block_job_create(const char *job_id, const BlockJobDriver *driver,
>       BlockBackend *blk;
>       BlockJob *job;
>   
> +    assert(qemu_in_main_thread());
> +
>       if (job_id == NULL && !(flags & JOB_INTERNAL)) {
>           job_id = bdrv_get_device_name(bs);
>       }
> @@ -504,6 +507,7 @@ void block_job_iostatus_reset(BlockJob *job)
>   
>   void block_job_user_resume(Job *job)
>   {
> +    assert(qemu_in_main_thread());
>       BlockJob *bjob = container_of(job, BlockJob, job);

Same here.

(And now I see that I’ve missed such instances in the other assertion 
patches, like in bdrv_save_vmstate(), those should be fixed, too)

Hanna

>       block_job_iostatus_reset(bjob);
>   }



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 12/25] assertions for blockob.h global state API
  2021-10-25 10:17 ` [PATCH v4 12/25] assertions for blockob.h " Emanuele Giuseppe Esposito
@ 2021-11-12 15:26   ` Hanna Reitz
  0 siblings, 0 replies; 86+ messages in thread
From: Hanna Reitz @ 2021-11-12 15:26 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

Subject: s/blockob.h/blockjob.h/

On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   blockjob.c | 10 ++++++++++
>   1 file changed, 10 insertions(+)
>
> diff --git a/blockjob.c b/blockjob.c
> index fbd6c7d873..4982f6a2b5 100644
> --- a/blockjob.c
> +++ b/blockjob.c
> @@ -61,6 +61,7 @@ static bool is_block_job(Job *job)
>   
>   BlockJob *block_job_next(BlockJob *bjob)
>   {
> +    assert(qemu_in_main_thread());
>       Job *job = bjob ? &bjob->job : NULL;

Here and...

>       do {
> @@ -72,6 +73,7 @@ BlockJob *block_job_next(BlockJob *bjob)
>   
>   BlockJob *block_job_get(const char *id)
>   {
> +    assert(qemu_in_main_thread());
>       Job *job = job_get(id);

...here, the assertion and declaration should be switched.

Hanna



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 13/25] include/sysemu/blockdev.h: move drive_add and inline drive_def
  2021-10-25 10:17 ` [PATCH v4 13/25] include/sysemu/blockdev.h: move drive_add and inline drive_def Emanuele Giuseppe Esposito
@ 2021-11-12 15:41   ` Hanna Reitz
  0 siblings, 0 replies; 86+ messages in thread
From: Hanna Reitz @ 2021-11-12 15:41 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
> drive_add is only used in softmmu/vl.c, so it can be a static
> function there,and drive_def is only a particular use case of
> qemu_opts_parse_noisily, so it can be inlined.
>
> Also remove drive_mark_claimed_by_board, as it is only defined
> but not implemented (nor used) anywhere.
>
> This also helps simplifying next patch.
>
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   block/monitor/block-hmp-cmds.c |  2 +-
>   blockdev.c                     | 27 +--------------------------
>   include/sysemu/blockdev.h      |  6 ++----
>   softmmu/vl.c                   | 25 ++++++++++++++++++++++++-
>   4 files changed, 28 insertions(+), 32 deletions(-)

[...]

> diff --git a/blockdev.c b/blockdev.c
> index c1f6171c6c..1bf49ef610 100644
> --- a/blockdev.c
> +++ b/blockdev.c
> @@ -73,7 +73,7 @@ void bdrv_set_monitor_owned(BlockDriverState *bs)
>       QTAILQ_INSERT_TAIL(&monitor_bdrv_states, bs, monitor_list);
>   }
>   
> -static const char *const if_name[IF_COUNT] = {
> +const char *const if_name[IF_COUNT] = {

When making this global, I’d give its name a prefix, like 
`block_if_name` (or even `block_if_type_to_name`).

Hanna

>       [IF_NONE] = "none",
>       [IF_IDE] = "ide",
>       [IF_SCSI] = "scsi",



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 14/25] include/systemu/blockdev.h: global state API
  2021-10-25 10:17 ` [PATCH v4 14/25] include/systemu/blockdev.h: global state API Emanuele Giuseppe Esposito
  2021-10-28 15:48   ` Stefan Hajnoczi
@ 2021-11-12 15:46   ` Hanna Reitz
  1 sibling, 0 replies; 86+ messages in thread
From: Hanna Reitz @ 2021-11-12 15:46 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

Subject: s/systemu/sysemu/

On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
> blockdev functions run always under the BQL lock.
>
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> ---
>   include/sysemu/blockdev.h | 18 ++++++++++++++----
>   1 file changed, 14 insertions(+), 4 deletions(-)
>
> diff --git a/include/sysemu/blockdev.h b/include/sysemu/blockdev.h
> index 960b54d320..b07f15df09 100644
> --- a/include/sysemu/blockdev.h
> +++ b/include/sysemu/blockdev.h
> @@ -13,9 +13,6 @@
>   #include "block/block.h"
>   #include "qemu/queue.h"
>   
> -void blockdev_mark_auto_del(BlockBackend *blk);
> -void blockdev_auto_del(BlockBackend *blk);
> -
>   typedef enum {
>       IF_DEFAULT = -1,            /* for use with drive_add() only */
>       /*
> @@ -40,6 +37,16 @@ struct DriveInfo {
>       QTAILQ_ENTRY(DriveInfo) next;
>   };
>   
> +/*
> + * Global state (GS) API. These functions run under the BQL lock.
> + *
> + * See include/block/block-global-state.h for more information about
> + * the GS API.
> + */
> +
> +void blockdev_mark_auto_del(BlockBackend *blk);
> +void blockdev_auto_del(BlockBackend *blk);
> +
>   DriveInfo *blk_legacy_dinfo(BlockBackend *blk);
>   DriveInfo *blk_set_legacy_dinfo(BlockBackend *blk, DriveInfo *dinfo);
>   BlockBackend *blk_by_legacy_dinfo(DriveInfo *dinfo);
> @@ -50,10 +57,13 @@ DriveInfo *drive_get(BlockInterfaceType type, int bus, int unit);
>   void drive_check_orphaned(void);
>   DriveInfo *drive_get_by_index(BlockInterfaceType type, int index);
>   int drive_get_max_bus(BlockInterfaceType type);
> -int drive_get_max_devs(BlockInterfaceType type);
>   DriveInfo *drive_get_next(BlockInterfaceType type);
>   
>   DriveInfo *drive_new(QemuOpts *arg, BlockInterfaceType block_default_type,
>                        Error **errp);
>   
> +/* Common functions that are neither I/O nor Global State */
> +
> +int drive_get_max_devs(BlockInterfaceType type);
> +

It seems to me like this function is never used and could just be 
dropped.  In any case, if it were used, it looks to me like it’d be used 
in a GS context.  (Not that I know anything about it, but I don’t see 
what makes it different from the other functions here.)

Hanna



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 19/25] block_int-common.h: split function pointers in BlockDriver
  2021-10-25 10:17 ` [PATCH v4 19/25] block_int-common.h: split function pointers in BlockDriver Emanuele Giuseppe Esposito
@ 2021-11-15 12:00   ` Hanna Reitz
  2021-11-18 12:42     ` Emanuele Giuseppe Esposito
  0 siblings, 1 reply; 86+ messages in thread
From: Hanna Reitz @ 2021-11-15 12:00 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
> Similar to the header split, also the function pointers in BlockDriver
> can be split in I/O and global state.
>
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   include/block/block_int-common.h | 458 ++++++++++++++++---------------
>   1 file changed, 237 insertions(+), 221 deletions(-)
>
> diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
> index 79a3d801d2..9857e775fe 100644
> --- a/include/block/block_int-common.h
> +++ b/include/block/block_int-common.h
> @@ -96,6 +96,7 @@ typedef struct BdrvTrackedRequest {
>   
>   
>   struct BlockDriver {
> +    /* Fields initialized in struct definition and never changed. */

I find this a bit difficult to understand.  Perhaps we could be more 
verbose to make it easier to read?  Like

“These fields are initialized when this object is created, and are never 
changed afterwards”

And I’d also add an empty line below to make visually clear that this 
does not describe the single subsequent field (format_name) but the 
whole chunk of fields.

I also wonder if we could substantiate the claim “never changed 
afterwards” by making these fields consts.  Then again, I suppose 
technically none of the fields in BlockDriver is supposed to be mutable 
(except for .list), neither these fields nor the function pointers...  
Yeah, maybe scratch that idea.

>       const char *format_name;
>       int instance_size;

[...]

> +    int coroutine_fn (*bdrv_save_vmstate)(BlockDriverState *bs,
> +                                          QEMUIOVector *qiov,
> +                                          int64_t pos);
> +    int coroutine_fn (*bdrv_load_vmstate)(BlockDriverState *bs,
> +                                          QEMUIOVector *qiov,
> +                                          int64_t pos);

In patch 18, you classified bdrv_co_readv_vmstate() and 
bdrv_co_writev_vmstate() as I/O functions.  They call these methods, 
though, so something doesn’t seem quite right.

Now I also remember you classified the global bdrv_save_vmstate() and 
bdrv_load_vmstate() functions as GS.  I don’t mind either way; AFAIU 
they are I/O functions that are generally called under the BQL.  But the 
classification should be consistent, of course.

> +
> +    int (*bdrv_change_backing_file)(BlockDriverState *bs,
> +        const char *backing_file, const char *backing_fmt);
> +
> +    void (*bdrv_eject)(BlockDriverState *bs, bool eject_flag);
> +    void (*bdrv_lock_medium)(BlockDriverState *bs, bool locked);

Above bdrv_is_inserted(), there’s a comment (“removable device 
specific”) that I think should be duplicated here.

> +
> +    /* TODO Better pass a option string/QDict/QemuOpts to add any rule? */
> +    int (*bdrv_debug_breakpoint)(BlockDriverState *bs, const char *event,
> +        const char *tag);
> +    int (*bdrv_debug_remove_breakpoint)(BlockDriverState *bs,
> +        const char *tag);
> +    int (*bdrv_debug_resume)(BlockDriverState *bs, const char *tag);
> +    bool (*bdrv_debug_is_suspended)(BlockDriverState *bs, const char *tag);
> +    void (*bdrv_refresh_limits)(BlockDriverState *bs, Error **errp);

Originally, there was an empty line before bdrv_refresh_limits(), and 
I’d keep it.

[...]

> +    /**
> +     * Try to get @bs's logical and physical block size.
> +     * On success, store them in @bsz and return zero.
> +     * On failure, return negative errno.
> +     */
> +    int (*bdrv_probe_blocksizes)(BlockDriverState *bs, BlockSizes *bsz);

Originally, there was a comment above this function (“I/O API, even 
though if it's a filter jumps on parent”) that you added in patch 6 and 
that I didn’t understand.  Given this, I’m not unhappy to see it go 
again, but now I wonder about its purpose even more...

> +    /**
> +     * Try to get @bs's geometry (cyls, heads, sectors)
> +     * On success, store them in @geo and return 0.
> +     * On failure return -errno.
> +     * Only drivers that want to override guest geometry implement this
> +     * callback; see hd_geometry_guess().
> +     */
> +    int (*bdrv_probe_geometry)(BlockDriverState *bs, HDGeometry *geo);

(Here, too)

[...]

> +    /**
> +     * Register/unregister a buffer for I/O. For example, when the driver is
> +     * interested to know the memory areas that will later be used in iovs, so
> +     * that it can do IOMMU mapping with VFIO etc., in order to get better
> +     * performance. In the case of VFIO drivers, this callback is used to do
> +     * DMA mapping for hot buffers.
> +     */
> +    void (*bdrv_register_buf)(BlockDriverState *bs, void *host, size_t size);
> +    void (*bdrv_unregister_buf)(BlockDriverState *bs, void *host);
> +    QLIST_ENTRY(BlockDriver) list;

This field is interesting because it’s the only mutable field in the 
whole struct.  I think it should be separated from the function pointers 
above and given a comment like “This field is modified only under the 
BQL, and is part of the global state”.

> +
> +    /*
> +     * I/O API functions. These functions are thread-safe.
> +     *
> +     * See include/block/block-io.h for more information about
> +     * the I/O API.
> +     */
> +
> +    int coroutine_fn (*bdrv_co_create)(BlockdevCreateOptions *opts,
> +                                       Error **errp);
> +    int coroutine_fn (*bdrv_co_create_opts)(BlockDriver *drv,
> +                                            const char *filename,
> +                                            QemuOpts *opts,
> +                                            Error **errp);

Now this is really interesting.  Technically I suppose these should work 
in any thread, but trying to do so results in:

$ touch /tmp/iothread-create-test.qcow2
$ ./qemu-system-x86_64 -object iothread,id=iothr0 -qmp stdio <<EOF
{"execute": "qmp_capabilities"}
{"execute":"blockdev-add","arguments":{"node-name":"proto","driver":"file","filename":"/tmp/iothread-create-test.qcow2"}}
{"execute":"x-blockdev-set-iothread","arguments":{"node-name":"proto","iothread":"iothr0"}}
{"execute":"blockdev-create","arguments":{"job-id":"create","options":{"driver":"qcow2","file":"proto","size":0}}}
EOF
{"QMP": {"version": {"qemu": {"micro": 90, "minor": 1, "major": 6}, 
"package": "v6.2.0-rc0-40-gd02d5fe5fb-dirty"}, "capabilities": ["oob"]}}
{"return": {}}
{"return": {}}
{"return": {}}
{"timestamp": {"seconds": 1636973542, "microseconds": 338117}, "event": 
"JOB_STATUS_CHANGE", "data": {"status": "created", "id": "create"}}
{"timestamp": {"seconds": 1636973542, "microseconds": 338197}, "event": 
"JOB_STATUS_CHANGE", "data": {"status": "running", "id": "create"}}
{"return": {}}
qemu: qemu_mutex_unlock_impl: Operation not permitted
[1]    86154 IOT instruction (core dumped)  ./qemu-system-x86_64 -object 
iothread,id=iothr0 -qmp stdio <<<''

So something’s fishy and perhaps we should investigate this...  I mean, 
I can’t really imagine a case where someone would need to run a 
blockdev-create job in an I/O thread, but right now the interface allows 
for it.

And then bdrv_create() is classified as global state, and also 
bdrv_co_create_opts_simple(), which is supposed to be a drop-in function 
for this .bdrv_co_create_opts function.  So that can’t work.

Also, I believe there might have been some functions you classified as 
GS that are called from .create* implementations.  I accepted that, 
given the abort I sketched above.  However, if we classify image 
creation as I/O, then those would need to be re-evaluated. For example, 
qcow2_co_create_opts() calls bdrv_create_file(), which is a GS function.

Some of this issues could be addressed by making .bdrv_co_create_opts a 
GS function and .bdrv_co_create an I/O function.  I believe that would 
be the ideal split, even though as shown above .bdrv_co_create doesn’t 
work in an I/O thread, and then you have the issue of probably all 
format drivers’ .bdrv_co_create implementations calling 
bdrv_open_blockdev_ref(), which is a GS function.

(VMDK even calls blk_new_open(), blk_new_with_bs(), and blk_unref(), 
none of which can ever be I/O functions, I think.)

I believe in practice the best is to for now classify all create-related 
functions as GS functions.  This is supported by the fact that 
qmp_blockdev_create() specifically creates the create job in the main 
context (with a TODO comment) and requires block drivers to error out 
when they encounter a node in a different AioContext.

> +    int coroutine_fn (*bdrv_co_amend)(BlockDriverState *bs,
> +                                      BlockdevAmendOptions *opts,
> +                                      bool force,
> +                                      Error **errp);
> +
>       /* aio */
>       BlockAIOCB *(*bdrv_aio_preadv)(BlockDriverState *bs,
>           int64_t offset, int64_t bytes, QEMUIOVector *qiov,

[...]

> @@ -443,47 +632,20 @@ struct BlockDriver {

Right above here (i.e. line 631), there’s an attribute 
`has_variable_length`.  I believe that should go up to the immutable 
fields section.

>       int64_t (*bdrv_get_allocated_file_size)(BlockDriverState *bs);
>       BlockMeasureInfo *(*bdrv_measure)(QemuOpts *opts, BlockDriverState *in_bs,
>                                         Error **errp);
> -

I’d keep this empty line.

Hanna

>       int coroutine_fn (*bdrv_co_pwritev_compressed)(BlockDriverState *bs,
>           int64_t offset, int64_t bytes, QEMUIOVector *qiov);
>       int coroutine_fn (*bdrv_co_pwritev_compressed_part)(BlockDriverState *bs,
>           int64_t offset, int64_t bytes, QEMUIOVector *qiov,
>           size_t qiov_offset);



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 02/25] include/block/block: split header into I/O and global state API
  2021-11-11 15:00   ` Hanna Reitz
@ 2021-11-15 12:08     ` Emanuele Giuseppe Esposito
  0 siblings, 0 replies; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-11-15 12:08 UTC (permalink / raw)
  To: Hanna Reitz, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

>> +/*
>> + * I/O API functions. These functions are thread-safe, and therefore
>> + * can run in any thread as long as the thread has called
>> + * aio_context_acquire/release().
>> + */
>> +
>> +int bdrv_replace_child_bs(BdrvChild *child, BlockDriverState *new_bs,
>> +                          Error **errp);
> 
> Why is this function here?  Naïvely, I would’ve assumed as a 
> graph-modifying function it should be in block-global-state.h.
> 
> I mean, perhaps it’s thread-safe and then it can fit here, too. Still, 
> it surprises me a bit to find this here.
> 
> Hanna
> 

Agree, I also tested this, it can go in global state. Will fix that.

Thank you,
Emanuele



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 03/25] assertions for block global state API
  2021-11-11 16:32   ` Hanna Reitz
@ 2021-11-15 12:27     ` Emanuele Giuseppe Esposito
  2021-11-15 15:27       ` Hanna Reitz
  0 siblings, 1 reply; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-11-15 12:27 UTC (permalink / raw)
  To: Hanna Reitz, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

> 
>> @@ -5958,6 +6043,7 @@ const char *bdrv_get_parent_name(const 
>> BlockDriverState *bs)
>>   /* TODO check what callers really want: bs->node_name or blk_name() */
>>   const char *bdrv_get_device_name(const BlockDriverState *bs)
>>   {
>> +    assert(qemu_in_main_thread());
>>       return bdrv_get_parent_name(bs) ?: "";
>>   }
> 
> This function is invoked from qcow2_signal_corruption(), which comes 
> generally from an I/O path.  Is it safe to assert that we’re in the main 
> thread here?
> 
> Well, the question is probably rather whether this needs really be a 
> considered a global-state function, or whether putting it in common or 
> I/O is fine.  I believe you’re right given that it invokes 
> bdrv_get_parent_name(), it cannot be thread-safe, but then we’ll have to 
> change qcow2_signal_corruption() so it doesn’t invoke this function.
> 

I think that the assertion is wrong here. As long as a single caller is 
not under BQL, we cannot keep the function in global-state headers.
It should go into block-io.h, together with bdrv_get_parent_name and 
bdrv_get_device_or_node_name (caller of bdrv_get_parent_name).

Since bdrv_get_parent_name only scans through the list of bs->parents, 
as long as we can assert that the parent list is modified only under BQL 
+ drain, it is safe to be read and can be I/O.

> [...]
> 
>> diff --git a/block/io.c b/block/io.c
>> index bb0a254def..c5d7f8495e 100644
>> --- a/block/io.c
>> +++ b/block/io.c
> 
> [...]
> 
>> @@ -544,6 +546,7 @@ void bdrv_drained_end(BlockDriverState *bs)
>>   void bdrv_drained_end_no_poll(BlockDriverState *bs, int 
>> *drained_end_counter)
>>   {
>> +    assert(qemu_in_main_thread());
>>       bdrv_do_drained_end(bs, false, NULL, false, drained_end_counter);
>>   }
> 
> Why is bdrv_drained_end an I/O function and this is a GS function, even 
> though it does just a subset?

Right this is clearly called in bdrv_drained_end. I don't know how is it 
possible that no test managed to trigger this assertion... This is 
actually invoked in both unit and qemu-iothread tests...

> 
>> @@ -586,12 +589,14 @@ void bdrv_unapply_subtree_drain(BdrvChild 
>> *child, BlockDriverState *old_parent)
>>   void coroutine_fn bdrv_co_drain(BlockDriverState *bs)
>>   {
>>       assert(qemu_in_coroutine());
>> +    assert(qemu_in_main_thread());
>>       bdrv_drained_begin(bs);
>>       bdrv_drained_end(bs);
>>   }
>>   void bdrv_drain(BlockDriverState *bs)
>>   {
>> +    assert(qemu_in_main_thread());
>>       bdrv_drained_begin(bs);
>>       bdrv_drained_end(bs);
>>   }
> 
> Why are these GS functions when both bdrv_drained_begin() and 
> bdrv_drained_end() are I/O functions?
> 
> I can understand making the drain_all functions GS functions, but it 
> seems weird to say it’s an I/O function when a single BDS is drained via 
> bdrv_drained_begin() and bdrv_drained_end(), but not via bdrv_drain(), 
> which just does both.
> 
> (I can see that there are no I/O path callers, but I still find it 
> strange.)

The problem as you saw is on the path callers: bdrv_drain seems to be 
called only in main thread, while others or similar drains are also 
coming from I/O. I would say maybe it's a better idea to just put 
everything I/O, maybe (probably) there are cases not caught by the tests.

The only ones in global-state will be:
- bdrv_apply_subtree_drain and bdrv_unapply_subtree_drain (called only 
in .attach and .detach callbacks of BdrvChildClass, run under BQL).
- bdrv_drain_all_end_quiesce (called only by bdrv_close thus under BQL).
- bdrv_drain_all_begin, bdrv_drain_all_end and bdrv_drain_all (have 
already BQL assertions)

In general, this is the underlying problem with these assertions: if a 
test doesn't test a specific code path, an unneeded assertion might pass 
undetected.

> 
> [...]
> 
>> @@ -2731,6 +2742,7 @@ int bdrv_block_status_above(BlockDriverState 
>> *bs, BlockDriverState *base,
>>   int bdrv_block_status(BlockDriverState *bs, int64_t offset, int64_t 
>> bytes,
>>                         int64_t *pnum, int64_t *map, BlockDriverState 
>> **file)
>>   {
>> +    assert(qemu_in_main_thread());
>>       return bdrv_block_status_above(bs, bdrv_filter_or_cow_bs(bs),
>>                                      offset, bytes, pnum, map, file);
>>   }
> 
> Why is this a GS function as opposed to all other block-status 
> functions?  Because of the bdrv_filter_or_cow_bs() call?

This function is in block.io but has the assertion. I think it's a 
proper I/O but I forgot to remove the assertion in my various trial&errors.

Let me know if you disagree on anything of what I said.

Thank you,
Emanuele



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 20/25] block_int-common.h: assertion in the callers of BlockDriver function pointers
  2021-10-25 10:17 ` [PATCH v4 20/25] block_int-common.h: assertion in the callers of BlockDriver function pointers Emanuele Giuseppe Esposito
@ 2021-11-15 12:48   ` Hanna Reitz
  2021-11-15 14:15     ` Hanna Reitz
  2021-11-17 11:33     ` Emanuele Giuseppe Esposito
  0 siblings, 2 replies; 86+ messages in thread
From: Hanna Reitz @ 2021-11-15 12:48 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   block.c | 17 +++++++++++++++++
>   1 file changed, 17 insertions(+)
>
> diff --git a/block.c b/block.c
> index 94bff5c757..40c4729b8d 100644
> --- a/block.c
> +++ b/block.c

[...]

> @@ -2148,6 +2152,7 @@ static void bdrv_child_perm(BlockDriverState *bs, BlockDriverState *child_bs,
>                               uint64_t *nperm, uint64_t *nshared)
>   {
>       assert(bs->drv && bs->drv->bdrv_child_perm);
> +    assert(qemu_in_main_thread());
>       bs->drv->bdrv_child_perm(bs, c, role, reopen_queue,
>                                parent_perm, parent_shared,
>                                nperm, nshared);

(Should’ve noticed earlier, but only did now...)

First, this function is indirectly called by bdrv_refresh_perms(). I 
understand that all perm-related functions are classified as GS.

However, bdrv_co_invalidate_cache() invokes bdrv_refresh_perms. Being 
declared in block/coroutine.h, it’s an I/O function, so it mustn’t call 
such a GS function. BlockDriver.bdrv_co_invalidate_cache(), 
bdrv_invalidate_cache(), and blk_invalidate_cache() are also classified 
as I/O functions. Perhaps all of these functions should be classified as 
GS functions?  I believe their callers and their purpose would allow for 
this.


Second, it’s called by bdrv_child_refresh_perms(), which is called by 
block_crypto_amend_options_generic_luks().  This function is called by 
block_crypto_co_amend_luks(), which is a BlockDriver.bdrv_co_amend 
implementation, which is classified as an I/O function.

Honestly, I don’t know how to fix that mess.  The best would be if we 
could make the perm functions thread-safe and classify them as I/O, but 
it seems to me like that’s impossible (I sure hope I’m wrong).  On the 
other hand, .bdrv_co_amend very much strikes me like a GS function, but 
it isn’t.  I’m afraid it must work on nodes that are not in the main 
context, and it launches a job, so AFAIU we absolutely cannot run it 
under the BQL.

It almost seems to me like we’d need a thread-safe variant of the perm 
functions that’s allowed to fail when it cannot guarantee thread safety 
or something.  Or perhaps I’m wrong and the perm functions can actually 
be classified as thread-safe and I/O, that’d be great…

Hanna



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 20/25] block_int-common.h: assertion in the callers of BlockDriver function pointers
  2021-11-15 12:48   ` Hanna Reitz
@ 2021-11-15 14:15     ` Hanna Reitz
  2021-11-17 11:33     ` Emanuele Giuseppe Esposito
  1 sibling, 0 replies; 86+ messages in thread
From: Hanna Reitz @ 2021-11-15 14:15 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

On 15.11.21 13:48, Hanna Reitz wrote:
> On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>> ---
>>   block.c | 17 +++++++++++++++++
>>   1 file changed, 17 insertions(+)
>>
>> diff --git a/block.c b/block.c
>> index 94bff5c757..40c4729b8d 100644
>> --- a/block.c
>> +++ b/block.c
>
> [...]
>
>> @@ -2148,6 +2152,7 @@ static void bdrv_child_perm(BlockDriverState 
>> *bs, BlockDriverState *child_bs,
>>                               uint64_t *nperm, uint64_t *nshared)
>>   {
>>       assert(bs->drv && bs->drv->bdrv_child_perm);
>> +    assert(qemu_in_main_thread());
>>       bs->drv->bdrv_child_perm(bs, c, role, reopen_queue,
>>                                parent_perm, parent_shared,
>>                                nperm, nshared);
>
> (Should’ve noticed earlier, but only did now...)
>
> First, this function is indirectly called by bdrv_refresh_perms(). I 
> understand that all perm-related functions are classified as GS.
>
> However, bdrv_co_invalidate_cache() invokes bdrv_refresh_perms. Being 
> declared in block/coroutine.h, it’s an I/O function, so it mustn’t 
> call such a GS function. BlockDriver.bdrv_co_invalidate_cache(), 
> bdrv_invalidate_cache(), and blk_invalidate_cache() are also 
> classified as I/O functions. Perhaps all of these functions should be 
> classified as GS functions?  I believe their callers and their purpose 
> would allow for this.
>
>
> Second, it’s called by bdrv_child_refresh_perms(), which is called by 
> block_crypto_amend_options_generic_luks().  This function is called by 
> block_crypto_co_amend_luks(), which is a BlockDriver.bdrv_co_amend 
> implementation, which is classified as an I/O function.
>
> Honestly, I don’t know how to fix that mess.  The best would be if we 
> could make the perm functions thread-safe and classify them as I/O, 
> but it seems to me like that’s impossible (I sure hope I’m wrong).  On 
> the other hand, .bdrv_co_amend very much strikes me like a GS 
> function, but it isn’t.  I’m afraid it must work on nodes that are not 
> in the main context, and it launches a job, so AFAIU we absolutely 
> cannot run it under the BQL.
>
> It almost seems to me like we’d need a thread-safe variant of the perm 
> functions that’s allowed to fail when it cannot guarantee thread 
> safety or something.  Or perhaps I’m wrong and the perm functions can 
> actually be classified as thread-safe and I/O, that’d be great…

Hm.  Can we perhaps let block_crypto_amend_options_generic_luks() take 
the BQL just for the permission adjustment (i.e. the 
bdrv_child_refresh_perms() call)?  Would that work? :/

Hanna



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 21/25] block_int-common.h: split function pointers in BdrvChildClass
  2021-10-25 10:17 ` [PATCH v4 21/25] block_int-common.h: split function pointers in BdrvChildClass Emanuele Giuseppe Esposito
@ 2021-11-15 14:36   ` Hanna Reitz
  0 siblings, 0 replies; 86+ messages in thread
From: Hanna Reitz @ 2021-11-15 14:36 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   include/block/block_int-common.h | 51 ++++++++++++++++++++------------
>   1 file changed, 32 insertions(+), 19 deletions(-)
>
> diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
> index 9857e775fe..ea16099c53 100644
> --- a/include/block/block_int-common.h
> +++ b/include/block/block_int-common.h

[...]

> +    /*
> +     * I/O API functions. These functions are thread-safe.
> +     *
> +     * See include/block/block-io.h for more information about
> +     * the I/O API.
> +     */
> +
> +    void (*resize)(BdrvChild *child);
> +
>       /*
>        * If this pair of functions is implemented, the parent doesn't issue new
>        * requests after returning from .drained_begin() until .drained_end() is
> @@ -859,23 +889,6 @@ struct BdrvChildClass {
>        */
>       void (*activate)(BdrvChild *child, Error **errp);
>       int (*inactivate)(BdrvChild *child);

These are now I/O functions, but I believe they should be GS functions: 
BlockBackend’s implementation for .activate (blk_root_activate()) 
invokes blk_set_perm(), which is a GS function.  Its .inactivate 
implementation (blk_root_inactivate()) invokes bdrv_child_try_set_perm().

(I also believe this makes sense in context: .activate is called by 
bdrv_co_invalidate_cache(), which should be a GS function as I’ve 
suggested in my reply to patch 20.  .inactivate is called by 
bdrv_inactivate_recurse(), which is a GS function, too.)

Hanna



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 22/25] block_int-common.h: assertions in the callers of BdrvChildClass function pointers
  2021-10-25 10:17 ` [PATCH v4 22/25] block_int-common.h: assertions in the callers of BdrvChildClass function pointers Emanuele Giuseppe Esposito
@ 2021-11-15 14:48   ` Hanna Reitz
  0 siblings, 0 replies; 86+ messages in thread
From: Hanna Reitz @ 2021-11-15 14:48 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   block.c | 10 ++++++++++
>   1 file changed, 10 insertions(+)
>
> diff --git a/block.c b/block.c
> index 40c4729b8d..da80e89ad4 100644
> --- a/block.c
> +++ b/block.c

[...]

> @@ -7173,6 +7182,7 @@ static bool bdrv_parent_can_set_aio_context(BdrvChild *c, AioContext *ctx,
>           return true;
>       }
>       *ignore = g_slist_prepend(*ignore, c);
> +    assert(qemu_in_main_thread());

It definitely isn’t wrong to place the assert() here, of course, but 
it’s an interesting place nonetheless.  In other places it seemed like 
you’d prefer to place it above the first non-declaration statement.  
Absolutely no need to change it if you don’t want to, just something 
that I noticed.

Hanna



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 24/25] job.h: split function pointers in JobDriver
  2021-10-25 10:17 ` [PATCH v4 24/25] job.h: split function pointers in JobDriver Emanuele Giuseppe Esposito
@ 2021-11-15 15:11   ` Hanna Reitz
  2021-11-17 13:43     ` Emanuele Giuseppe Esposito
  0 siblings, 1 reply; 86+ messages in thread
From: Hanna Reitz @ 2021-11-15 15:11 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
> The job API will be handled separately in another serie.
>
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   include/qemu/job.h | 16 ++++++++++++++++
>   1 file changed, 16 insertions(+)
>
> diff --git a/include/qemu/job.h b/include/qemu/job.h
> index 6e67b6977f..7e9e59f4b8 100644
> --- a/include/qemu/job.h
> +++ b/include/qemu/job.h
> @@ -169,12 +169,21 @@ typedef struct Job {
>    * Callbacks and other information about a Job driver.
>    */
>   struct JobDriver {
> +
> +    /* Fields initialized in struct definition and never changed. */

Like in patch 19, I’d prefer a slightly more verbose comment that I’d 
find more easily readable.

> +
>       /** Derived Job struct size */
>       size_t instance_size;
>   
>       /** Enum describing the operation */
>       JobType job_type;
>   
> +    /*
> +     * Functions run without regard to the BQL and may run in any

s/and/that/?

> +     * arbitrary thread. These functions do not need to be thread-safe
> +     * because the caller ensures that are invoked from one thread at time.

s/that/they/ (or “that they”)

I believe .run() must be run in the job’s context, though.  Not sure if 
that’s necessary to note, but it isn’t really an arbitrary thread, and 
block jobs certainly require this (because they run in the block 
device’s context).  Or is that something that’s going to change with I/O 
threading?

Hanna



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 03/25] assertions for block global state API
  2021-11-15 12:27     ` Emanuele Giuseppe Esposito
@ 2021-11-15 15:27       ` Hanna Reitz
  0 siblings, 0 replies; 86+ messages in thread
From: Hanna Reitz @ 2021-11-15 15:27 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

On 15.11.21 13:27, Emanuele Giuseppe Esposito wrote:
>>
>>> @@ -586,12 +589,14 @@ void bdrv_unapply_subtree_drain(BdrvChild 
>>> *child, BlockDriverState *old_parent)
>>>   void coroutine_fn bdrv_co_drain(BlockDriverState *bs)
>>>   {
>>>       assert(qemu_in_coroutine());
>>> +    assert(qemu_in_main_thread());
>>>       bdrv_drained_begin(bs);
>>>       bdrv_drained_end(bs);
>>>   }
>>>   void bdrv_drain(BlockDriverState *bs)
>>>   {
>>> +    assert(qemu_in_main_thread());
>>>       bdrv_drained_begin(bs);
>>>       bdrv_drained_end(bs);
>>>   }
>>
>> Why are these GS functions when both bdrv_drained_begin() and 
>> bdrv_drained_end() are I/O functions?
>>
>> I can understand making the drain_all functions GS functions, but it 
>> seems weird to say it’s an I/O function when a single BDS is drained 
>> via bdrv_drained_begin() and bdrv_drained_end(), but not via 
>> bdrv_drain(), which just does both.
>>
>> (I can see that there are no I/O path callers, but I still find it 
>> strange.)
>
> The problem as you saw is on the path callers: bdrv_drain seems to be 
> called only in main thread, while others or similar drains are also 
> coming from I/O. I would say maybe it's a better idea to just put 
> everything I/O, maybe (probably) there are cases not caught by the tests.
>
> The only ones in global-state will be:
> - bdrv_apply_subtree_drain and bdrv_unapply_subtree_drain (called only 
> in .attach and .detach callbacks of BdrvChildClass, run under BQL).
> - bdrv_drain_all_end_quiesce (called only by bdrv_close thus under BQL).
> - bdrv_drain_all_begin, bdrv_drain_all_end and bdrv_drain_all (have 
> already BQL assertions)
>
> In general, this is the underlying problem with these assertions: if a 
> test doesn't test a specific code path, an unneeded assertion might 
> pass undetected.

I believe the I/O vs. GS classification should not only be done based on 
functional concerns, i.e. who calls this function (is it called by an 
I/O function?) and whether it can be thread-safe or not (does it call a 
BQL function?), but also needs to be done semantically.  I believe there 
are some functions that we consider thread-safe but are also only called 
by BQL functions, for example apparently bdrv_drain(), 
bdrv_apply_subtree_drain(), and bdrv_unapply_subtree_drain().  Such 
functions can functionally be considered both GS and I/O functions, so 
the decision should be done semantically.  I believe that it makes sense 
to say all drain-related functions for a single subtree are I/O functions.

OTOH, functions operating on multiple trees in the block graph (i.e. all 
functions that have some “_all_” in their name) should naturally be 
considered GS functions, regardless of whether their implementation is 
thread-safe or not, because they are by nature global functions.

That’s why I’m wondering why some drain functions are I/O and others are 
GS; or why this block-status function is I/O and this one is GS.  It may 
make sense functionally, but semantically it’s strange.

I understand it may be difficult for you to know which functions relate 
to each other and thus know these semantic relationships, but in most of 
the series the split seems very reasonable to me, semantically.  If it 
didn’t, I said so. :)

Hanna



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 00/25] block layer: split block APIs in global state and I/O
  2021-10-25 10:17 [PATCH v4 00/25] block layer: split block APIs in global state and I/O Emanuele Giuseppe Esposito
                   ` (26 preceding siblings ...)
  2021-10-28 15:49 ` Stefan Hajnoczi
@ 2021-11-15 16:03 ` Hanna Reitz
  2021-11-15 16:11   ` Daniel P. Berrangé
                     ` (2 more replies)
  27 siblings, 3 replies; 86+ messages in thread
From: Hanna Reitz @ 2021-11-15 16:03 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
> Currently, block layer APIs like block-backend.h contain a mix of
> functions that are either running in the main loop and under the
> BQL, or are thread-safe functions and run in iothreads performing I/O.
> The functions running under BQL also take care of modifying the
> block graph, by using drain and/or aio_context_acquire/release.
> This makes it very confusing to understand where each function
> runs, and what assumptions it provided with regards to thread
> safety.
>
> We call the functions running under BQL "global state (GS) API", and
> distinguish them from the thread-safe "I/O API".
>
> The aim of this series is to split the relevant block headers in
> global state and I/O sub-headers.

Despite leaving quite some comments, the series and the split seem 
reasonable to me overall.  (This is a pretty big series, after all, so 
those “some comments” stack up against a majority of changes that seem 
OK to me. :))

One thing I noticed while reviewing is that it’s really hard to verify 
that no I/O function calls a GS function.  What would be wonderful is 
some function marker like coroutine_fn that marks GS functions (or I/O 
functions) and that we could then verify the call paths.  But AFAIU 
we’ve always wanted precisely that for coroutine_fn and still don’t have 
it, so this seems like extremely wishful thinking... :(

I think most of the issues I found can be fixed (or are even 
irrelevant), the only thing that really worries me are the two places 
that are clearly I/O paths that call permission functions: Namely first 
block_crypto_amend_options_generic_luks() (part of the luks block 
driver’s .bdrv_co_amend implementation), which calls 
bdrv_child_refresh_perms(); and second fuse_do_truncate(), which calls 
blk_set_perm().

In the first case, we need this call so that we don’t permanently hog 
the WRITE permission for the luks file, which used to be a problem, I 
believe.  We want to unshare the WRITE permission (and apparently also 
CONSISTENT_READ) during the key update, so we need some way to 
temporarily update the permissions.

I only really see four solutions for this:
(1) We somehow make the amend job run in the main context under the BQL 
and have it prevent all concurrent I/O access (seems bad)
(2) We can make the permission functions part of the I/O path (seems 
wrong and probably impossible?)
(3) We can drop the permissions update and permanently require the 
permissions that we need when updating keys (I think this might break 
existing use cases)
(4) We can acquire the BQL around the permission update call and perhaps 
that works?

I don’t know how (4) would work but it’s basically the only reasonable 
solution I can come up with.  Would this be a way to call a BQL function 
from an I/O function?

As for the second case, the same applies as above, with the differences 
that we have no jobs, so this code must always run in the block device’s 
AioContext (I think), which rules out (1); but (3) would become easier 
(i.e. require the RESIZE permission all the time), although that too 
might have an impact on existing users (don’t think so, though).  In any 
case, if we could do (4), that would solve the problem here, too.


And finally, another notable thing I noticed is that the way how 
create-related functions are handled is inconsistent.  I believe they 
should all be GS functions; qmp_blockdev_create() seems to agree with me 
on this, but we currently seem to have some bugs there.  It’s possible 
to invoke blockdev-create on a block device that’s in an I/O thread, and 
then qemu crashes.  Oops.  (The comment in qmp_blockdev_create() says 
that the block drivers’ implementations should prevent this, but 
apparently they don’t...?) In any case, that’s a pre-existing bug, of 
course, that doesn’t concern this series (other than that it suggests 
that “create” functions should be classified as GS).

Hanna



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 00/25] block layer: split block APIs in global state and I/O
  2021-11-15 16:03 ` Hanna Reitz
@ 2021-11-15 16:11   ` Daniel P. Berrangé
  2021-11-18 13:50   ` Paolo Bonzini
  2021-11-18 14:04   ` Paolo Bonzini
  2 siblings, 0 replies; 86+ messages in thread
From: Daniel P. Berrangé @ 2021-11-15 16:11 UTC (permalink / raw)
  To: Hanna Reitz
  Cc: Emanuele Giuseppe Esposito, Kevin Wolf,
	Vladimir Sementsov-Ogievskiy, Eduardo Habkost, qemu-block,
	Juan Quintela, qemu-devel, Eric Blake, Richard Henderson,
	Markus Armbruster, Dr. David Alan Gilbert, Stefan Hajnoczi,
	Paolo Bonzini, Fam Zheng, John Snow

On Mon, Nov 15, 2021 at 05:03:28PM +0100, Hanna Reitz wrote:
> On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
> > Currently, block layer APIs like block-backend.h contain a mix of
> > functions that are either running in the main loop and under the
> > BQL, or are thread-safe functions and run in iothreads performing I/O.
> > The functions running under BQL also take care of modifying the
> > block graph, by using drain and/or aio_context_acquire/release.
> > This makes it very confusing to understand where each function
> > runs, and what assumptions it provided with regards to thread
> > safety.
> > 
> > We call the functions running under BQL "global state (GS) API", and
> > distinguish them from the thread-safe "I/O API".
> > 
> > The aim of this series is to split the relevant block headers in
> > global state and I/O sub-headers.
> 
> Despite leaving quite some comments, the series and the split seem
> reasonable to me overall.  (This is a pretty big series, after all, so those
> “some comments” stack up against a majority of changes that seem OK to me.
> :))
> 
> One thing I noticed while reviewing is that it’s really hard to verify that
> no I/O function calls a GS function.  What would be wonderful is some
> function marker like coroutine_fn that marks GS functions (or I/O functions)
> and that we could then verify the call paths.  But AFAIU we’ve always wanted
> precisely that for coroutine_fn and still don’t have it, so this seems like
> extremely wishful thinking... :(

Even if we don't programmatically verify these rules, it would be a major
step forward if we simply adopted a standard naming convention for the
APIs that was essentally self-describing. eg block_io_XXX for all I/O
stuff and block_state_XXXX for all global state ,and block_common if
we have stuff applicable to both use cases.

I wouldn't suggest doing it as part of this series, but I think it'd
be worthwhile in general for the medium-long term, despite the code
churn in updating all usage in the short term.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 05/25] block/block-backend.c: assertions for block-backend
  2021-11-12 11:01   ` Hanna Reitz
@ 2021-11-16 10:15     ` Emanuele Giuseppe Esposito
  2021-11-16 12:29       ` Hanna Reitz
  0 siblings, 1 reply; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-11-16 10:15 UTC (permalink / raw)
  To: Hanna Reitz, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake



On 12/11/2021 12:01, Hanna Reitz wrote:
> On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
>> All the global state (GS) API functions will check that
>> qemu_in_main_thread() returns true. If not, it means
>> that the safety of BQL cannot be guaranteed, and
>> they need to be moved to I/O.
>>
>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>> ---
>>   block/block-backend.c  | 90 +++++++++++++++++++++++++++++++++++++++++-
>>   softmmu/qdev-monitor.c |  2 +
>>   2 files changed, 91 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/block-backend.c b/block/block-backend.c
>> index 0afc03fd66..ed45576007 100644
>> --- a/block/block-backend.c
>> +++ b/block/block-backend.c
> 
> [...]
> 
>> @@ -1550,6 +1596,7 @@ BlockAIOCB *blk_aio_pwritev(BlockBackend *blk, 
>> int64_t offset,
>>   void blk_aio_cancel(BlockAIOCB *acb)
>>   {
>> +    assert(qemu_in_main_thread());
>>       bdrv_aio_cancel(acb);
>>   }
> 
> This function is in block-backend-io.h, though.

I am confused a little on the {blk/bdrv}_aio functions, namely
blk_aio_cancel
bdrv_aio_cancel
blk_aio_cancel_async
bdrv_aio_cancel_async

Do you think they should be I/O? The assertion seems to hold though.

I agree with the rest of the comments.

Thank you,
Emanuele



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 04/25] include/sysemu/block-backend: split header into I/O and global state (GS) API
  2021-11-12 10:23   ` Hanna Reitz
@ 2021-11-16 10:16     ` Emanuele Giuseppe Esposito
  0 siblings, 0 replies; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-11-16 10:16 UTC (permalink / raw)
  To: Hanna Reitz, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake


>> +int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags);
>> +int64_t blk_nb_sectors(BlockBackend *blk);
> 
> I’d have considered this an I/O function, and blk_getlength() is 
> classified as such.  Why not this?

This one by itself is only invoked under BQL. I believe in facts that in 
migration/block.c is always wrapped by qemu_mutex_{lock/unlock}_iothread()
pairs.

But on the other side, as you said, semantically maybe it makes more 
sense to put it as I/O. Also its bdrv_ counterpart, bdrv_nb_sectors, is 
an I/O so you are right.

>> +int blk_probe_geometry(BlockBackend *blk, HDGeometry *geo);
>> +BlockAIOCB *blk_abort_aio_request(BlockBackend *blk,
>> +                                  BlockCompletionFunc *cb,
>> +                                  void *opaque, int ret);
> 
> This sounds more like an I/O function to me.
> 

Semantically might make sense as an I/O. Functionally I don't see any 
iothread using it.

I agree with the rest of the comments.

Thank you,
Emanuele



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 06/25] include/block/block_int: split header into I/O and global state API
  2021-11-12 12:17   ` Hanna Reitz
@ 2021-11-16 10:24     ` Emanuele Giuseppe Esposito
  2021-11-16 12:30       ` Hanna Reitz
  0 siblings, 1 reply; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-11-16 10:24 UTC (permalink / raw)
  To: Hanna Reitz, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake



On 12/11/2021 13:17, Hanna Reitz wrote:
> On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
>> Similarly to the previous patch, split block_int.h
>> in block_int-io.h and block_int-global-state.h
>>
>> block_int-common.h contains the structures shared between
>> the two headers, and the functions that can't be categorized as
>> I/O or global state.
>>
>> Assertions are added in the next patch.
>>
>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>> ---
>>   blockdev.c                             |    5 +
>>   include/block/block_int-common.h       | 1164 +++++++++++++++++++
>>   include/block/block_int-global-state.h |  319 +++++
>>   include/block/block_int-io.h           |  163 +++
>>   include/block/block_int.h              | 1478 +-----------------------
>>   5 files changed, 1654 insertions(+), 1475 deletions(-)
>>   create mode 100644 include/block/block_int-common.h
>>   create mode 100644 include/block/block_int-global-state.h
>>   create mode 100644 include/block/block_int-io.h
> 
> [...]
> 
>> diff --git a/include/block/block_int-common.h 
>> b/include/block/block_int-common.h
>> new file mode 100644
>> index 0000000000..79a3d801d2
>> --- /dev/null
>> +++ b/include/block/block_int-common.h
> 
> [...]
> 
>> +struct BlockDriver {
> 
> [...]
> 
>> +    /**
>> +     * Try to get @bs's logical and physical block size.
>> +     * On success, store them in @bsz and return zero.
>> +     * On failure, return negative errno.
>> +     */
>> +    /* I/O API, even though if it's a filter jumps on parent */
> 
> I don’t understand this...
> 
>> +    int (*bdrv_probe_blocksizes)(BlockDriverState *bs, BlockSizes *bsz);
>> +    /**
>> +     * Try to get @bs's geometry (cyls, heads, sectors)
>> +     * On success, store them in @geo and return 0.
>> +     * On failure return -errno.
>> +     * Only drivers that want to override guest geometry implement this
>> +     * callback; see hd_geometry_guess().
>> +     */
>> +    /* I/O API, even though if it's a filter jumps on parent */
> 
> ...or this comment.  bdrv_probe_blocksizes() and bdrv_probe_geometry() 
> are in block-global-state.h, so why are these methods part of the I/O 
> API?  (And I’m afraid I can’t parse “even though if it’s a filter jumps 
> on parent”.)
>

Ok this is weird. This comment should not have been there, please ignore 
it. It was just a note for myself while I was doing one the the many 
pass to split all these functions. This is not I/O and as you probably 
have already seen, I did not put in I/O. Also patch 19 takes care of the 
function pointers in BlockDriver, not this (but you discovered it already).

Apologies.

Emanuele

> Hanna
> 
>> +    int (*bdrv_probe_geometry)(BlockDriverState *bs, HDGeometry *geo);
> 



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 05/25] block/block-backend.c: assertions for block-backend
  2021-11-16 10:15     ` Emanuele Giuseppe Esposito
@ 2021-11-16 12:29       ` Hanna Reitz
  0 siblings, 0 replies; 86+ messages in thread
From: Hanna Reitz @ 2021-11-16 12:29 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

On 16.11.21 11:15, Emanuele Giuseppe Esposito wrote:
>
>
> On 12/11/2021 12:01, Hanna Reitz wrote:
>> On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
>>> All the global state (GS) API functions will check that
>>> qemu_in_main_thread() returns true. If not, it means
>>> that the safety of BQL cannot be guaranteed, and
>>> they need to be moved to I/O.
>>>
>>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>>> ---
>>>   block/block-backend.c  | 90 
>>> +++++++++++++++++++++++++++++++++++++++++-
>>>   softmmu/qdev-monitor.c |  2 +
>>>   2 files changed, 91 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/block/block-backend.c b/block/block-backend.c
>>> index 0afc03fd66..ed45576007 100644
>>> --- a/block/block-backend.c
>>> +++ b/block/block-backend.c
>>
>> [...]
>>
>>> @@ -1550,6 +1596,7 @@ BlockAIOCB *blk_aio_pwritev(BlockBackend *blk, 
>>> int64_t offset,
>>>   void blk_aio_cancel(BlockAIOCB *acb)
>>>   {
>>> +    assert(qemu_in_main_thread());
>>>       bdrv_aio_cancel(acb);
>>>   }
>>
>> This function is in block-backend-io.h, though.
>
> I am confused a little on the {blk/bdrv}_aio functions, namely
> blk_aio_cancel
> bdrv_aio_cancel
> blk_aio_cancel_async
> bdrv_aio_cancel_async
>
> Do you think they should be I/O? The assertion seems to hold though.

Hm, semantically I would have classified them as I/O because they don’t 
modify state.  I don’t have a strong opinion, though, because they don’t 
actually do I/O.  They just cancel other I/O requests.

Most importantly though now I see there’s a comment in bdrv_aio_cancel() 
that states that “thread-safe code should use bdrv_aio_cancel_async 
exclusively”, which to me implies that bdrv_aio_cancel() (and 
blk_aio_cancel()) must be classified as GS, and it sounds like 
bdrv_aio_cancel_async() (and blk_aio_cancel_async()) should be 
classified as I/O.  Looking at the AIOCBInfo.cancel_async 
implementations (called by bdrv_aio_cancel_async()) I’m not sure they’re 
all really thread-safe, though...?  But at least bdrv_aio_cancel() 
claims they should be, so...

It seems to me like the intended separation is that bdrv_aio_cancel() 
should be GS and bdrv_aio_cancel_async() should be I/O.  I can’t verify 
that the .cancel_async implementations are really thread-safe, but 
neither can I verify that blk_aio_cancel_async() is only called by BQL 
callers.  That the assertions hold during testing isn’t too convincing 
for me, because we never wrote tests specifically to exercise these paths.

Hanna



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 06/25] include/block/block_int: split header into I/O and global state API
  2021-11-16 10:24     ` Emanuele Giuseppe Esposito
@ 2021-11-16 12:30       ` Hanna Reitz
  0 siblings, 0 replies; 86+ messages in thread
From: Hanna Reitz @ 2021-11-16 12:30 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

On 16.11.21 11:24, Emanuele Giuseppe Esposito wrote:
>
>
> On 12/11/2021 13:17, Hanna Reitz wrote:
>> On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
>>> Similarly to the previous patch, split block_int.h
>>> in block_int-io.h and block_int-global-state.h
>>>
>>> block_int-common.h contains the structures shared between
>>> the two headers, and the functions that can't be categorized as
>>> I/O or global state.
>>>
>>> Assertions are added in the next patch.
>>>
>>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>>> ---
>>>   blockdev.c                             |    5 +
>>>   include/block/block_int-common.h       | 1164 +++++++++++++++++++
>>>   include/block/block_int-global-state.h |  319 +++++
>>>   include/block/block_int-io.h           |  163 +++
>>>   include/block/block_int.h              | 1478 
>>> +-----------------------
>>>   5 files changed, 1654 insertions(+), 1475 deletions(-)
>>>   create mode 100644 include/block/block_int-common.h
>>>   create mode 100644 include/block/block_int-global-state.h
>>>   create mode 100644 include/block/block_int-io.h
>>
>> [...]
>>
>>> diff --git a/include/block/block_int-common.h 
>>> b/include/block/block_int-common.h
>>> new file mode 100644
>>> index 0000000000..79a3d801d2
>>> --- /dev/null
>>> +++ b/include/block/block_int-common.h
>>
>> [...]
>>
>>> +struct BlockDriver {
>>
>> [...]
>>
>>> +    /**
>>> +     * Try to get @bs's logical and physical block size.
>>> +     * On success, store them in @bsz and return zero.
>>> +     * On failure, return negative errno.
>>> +     */
>>> +    /* I/O API, even though if it's a filter jumps on parent */
>>
>> I don’t understand this...
>>
>>> +    int (*bdrv_probe_blocksizes)(BlockDriverState *bs, BlockSizes 
>>> *bsz);
>>> +    /**
>>> +     * Try to get @bs's geometry (cyls, heads, sectors)
>>> +     * On success, store them in @geo and return 0.
>>> +     * On failure return -errno.
>>> +     * Only drivers that want to override guest geometry implement 
>>> this
>>> +     * callback; see hd_geometry_guess().
>>> +     */
>>> +    /* I/O API, even though if it's a filter jumps on parent */
>>
>> ...or this comment.  bdrv_probe_blocksizes() and 
>> bdrv_probe_geometry() are in block-global-state.h, so why are these 
>> methods part of the I/O API?  (And I’m afraid I can’t parse “even 
>> though if it’s a filter jumps on parent”.)
>>
>
> Ok this is weird. This comment should not have been there, please 
> ignore it. It was just a note for myself while I was doing one the the 
> many pass to split all these functions. This is not I/O and as you 
> probably have already seen, I did not put in I/O. Also patch 19 takes 
> care of the function pointers in BlockDriver, not this (but you 
> discovered it already).
>
> Apologies.

Not a problem, things like this happen.  Just makes you wonder as a 
reviewer. :)

Hanna



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 02/25] include/block/block: split header into I/O and global state API
  2021-11-12 12:25   ` Hanna Reitz
@ 2021-11-16 14:00     ` Emanuele Giuseppe Esposito
  0 siblings, 0 replies; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-11-16 14:00 UTC (permalink / raw)
  To: Hanna Reitz, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake



On 12/11/2021 13:25, Hanna Reitz wrote:
> On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
>> block.h currently contains a mix of functions:
>> some of them run under the BQL and modify the block layer graph,
>> others are instead thread-safe and perform I/O in iothreads.
>> It is not easy to understand which function is part of which
>> group (I/O vs GS), and this patch aims to clarify it.
>>
>> The "GS" functions need the BQL, and often use
>> aio_context_acquire/release and/or drain to be sure they
>> can modify the graph safely.
>> The I/O function are instead thread safe, and can run in
>> any AioContext.
>>
>> By splitting the header in two files, block-io.h
>> and block-global-state.h we have a clearer view on what
>> needs what kind of protection. block-common.h
>> contains common structures shared by both headers.
>>
>> block.h is left there for legacy and to avoid changing
>> all includes in all c files that use the block APIs.
>>
>> Assertions are added in the next patch.
>>
>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>> ---
>>   block.c                            |   3 +
>>   block/meson.build                  |   7 +-
>>   include/block/block-common.h       | 389 +++++++++++++
>>   include/block/block-global-state.h | 286 ++++++++++
>>   include/block/block-io.h           | 306 ++++++++++
>>   include/block/block.h              | 878 +----------------------------
>>   6 files changed, 1012 insertions(+), 857 deletions(-)
>>   create mode 100644 include/block/block-common.h
>>   create mode 100644 include/block/block-global-state.h
>>   create mode 100644 include/block/block-io.h
> 
> [...]
> 
>> diff --git a/include/block/block-common.h b/include/block/block-common.h
>> new file mode 100644
>> index 0000000000..4f1fd8de21
>> --- /dev/null
>> +++ b/include/block/block-common.h
> 
> [...]
> 
>> +#define BLKDBG_EVENT(child, evt) \
>> +    do { \
>> +        if (child) { \
>> +            bdrv_debug_event(child->bs, evt); \
>> +        } \
>> +    } while (0)
> 
> This is defined twice, once here, and...

This is unnecessary, as bdrv_debug_event is in block-io.c
Will remove that.

> 
>> diff --git a/include/block/block-io.h b/include/block/block-io.h
>> new file mode 100644
>> index 0000000000..9af4609ccb
>> --- /dev/null
>> +++ b/include/block/block-io.h
> 
> [...]
> 
>> +#define BLKDBG_EVENT(child, evt) \
>> +    do { \
>> +        if (child) { \
>> +            bdrv_debug_event(child->bs, evt); \
>> +        } \
>> +    } while (0)
> 
> ...once here.
> 
> [...]
> 
>> +/**
>> + * bdrv_drained_begin:
>> + *
>> + * Begin a quiesced section for exclusive access to the BDS, by 
>> disabling
>> + * external request sources including NBD server and device model. 
>> Note that
>> + * this doesn't block timers or coroutines from submitting more 
>> requests, which
>> + * means block_job_pause is still necessary.
> 
> Where does this sentence come from?  I can’t see it in master or in the 
> lines removed from block.h:
> 
This is another mistake, I copied the old header. This sentence was 
removed by this patch (sent by me) and then merged in master:
https://patchew.org/QEMU/20210903113800.59970-1-eesposit@redhat.com/

Will fix this and double check all headers so that the comments are 
updated (but there shouldn't be any other mistakes).

Thank you,
Emanuele



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 04/25] include/sysemu/block-backend: split header into I/O and global state (GS) API
  2021-11-12 12:30   ` Hanna Reitz
@ 2021-11-16 14:24     ` Emanuele Giuseppe Esposito
  2021-11-16 15:07       ` Hanna Reitz
  0 siblings, 1 reply; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-11-16 14:24 UTC (permalink / raw)
  To: Hanna Reitz, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake



On 12/11/2021 13:30, Hanna Reitz wrote:
> On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
>> Similarly to the previous patches, split block-backend.h
>> in block-backend-io.h and block-backend-global-state.h
>>
>> In addition, remove "block/block.h" include as it seems
>> it is not necessary anymore, together with "qemu/iov.h"
>>
>> block-backend-common.h contains the structures shared between
>> the two headers, and the functions that can't be categorized as
>> I/O or global state.
>>
>> Assertions are added in the next patch.
>>
>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>> ---
>>   block/block-backend.c                       |   9 +-
>>   include/sysemu/block-backend-common.h       |  74 ++++++
>>   include/sysemu/block-backend-global-state.h | 122 +++++++++
>>   include/sysemu/block-backend-io.h           | 139 ++++++++++
>>   include/sysemu/block-backend.h              | 269 +-------------------
>>   5 files changed, 344 insertions(+), 269 deletions(-)
>>   create mode 100644 include/sysemu/block-backend-common.h
>>   create mode 100644 include/sysemu/block-backend-global-state.h
>>   create mode 100644 include/sysemu/block-backend-io.h
> 
> [...]
> 
>> diff --git a/include/sysemu/block-backend.h 
>> b/include/sysemu/block-backend.h
>> index e5e1524f06..038be9fc40 100644
>> --- a/include/sysemu/block-backend.h
>> +++ b/include/sysemu/block-backend.h
>> @@ -13,272 +13,9 @@
>>   #ifndef BLOCK_BACKEND_H
>>   #define BLOCK_BACKEND_H
>> -#include "qemu/iov.h"
>> -#include "block/throttle-groups.h"
>> +#include "block-backend-global-state.h"
>> +#include "block-backend-io.h"
>> -/*
>> - * TODO Have to include block/block.h for a bunch of block layer
>> - * types.  Unfortunately, this pulls in the whole BlockDriverState
>> - * API, which we don't want used by many BlockBackend users.  Some of
>> - * the types belong here, and the rest should be split into a common
>> - * header and one for the BlockDriverState API.
>> - */
>> -#include "block/block.h"
> 
> This note and the include is gone.  Sounds like something positive, but 
> why is this possible?
> 

Basically block/throttle-groups.h includes block/block_int.h that 
internally includes block/block.h.

But I am not sure if you actually want to keep this comment as reminder 
for future work. Should I keep it?

Thank you,
Emanuele



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 04/25] include/sysemu/block-backend: split header into I/O and global state (GS) API
  2021-11-16 14:24     ` Emanuele Giuseppe Esposito
@ 2021-11-16 15:07       ` Hanna Reitz
  0 siblings, 0 replies; 86+ messages in thread
From: Hanna Reitz @ 2021-11-16 15:07 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

On 16.11.21 15:24, Emanuele Giuseppe Esposito wrote:
>
>
> On 12/11/2021 13:30, Hanna Reitz wrote:
>> On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
>>> Similarly to the previous patches, split block-backend.h
>>> in block-backend-io.h and block-backend-global-state.h
>>>
>>> In addition, remove "block/block.h" include as it seems
>>> it is not necessary anymore, together with "qemu/iov.h"
>>>
>>> block-backend-common.h contains the structures shared between
>>> the two headers, and the functions that can't be categorized as
>>> I/O or global state.
>>>
>>> Assertions are added in the next patch.
>>>
>>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>>> ---
>>>   block/block-backend.c                       |   9 +-
>>>   include/sysemu/block-backend-common.h       |  74 ++++++
>>>   include/sysemu/block-backend-global-state.h | 122 +++++++++
>>>   include/sysemu/block-backend-io.h           | 139 ++++++++++
>>>   include/sysemu/block-backend.h              | 269 
>>> +-------------------
>>>   5 files changed, 344 insertions(+), 269 deletions(-)
>>>   create mode 100644 include/sysemu/block-backend-common.h
>>>   create mode 100644 include/sysemu/block-backend-global-state.h
>>>   create mode 100644 include/sysemu/block-backend-io.h
>>
>> [...]
>>
>>> diff --git a/include/sysemu/block-backend.h 
>>> b/include/sysemu/block-backend.h
>>> index e5e1524f06..038be9fc40 100644
>>> --- a/include/sysemu/block-backend.h
>>> +++ b/include/sysemu/block-backend.h
>>> @@ -13,272 +13,9 @@
>>>   #ifndef BLOCK_BACKEND_H
>>>   #define BLOCK_BACKEND_H
>>> -#include "qemu/iov.h"
>>> -#include "block/throttle-groups.h"
>>> +#include "block-backend-global-state.h"
>>> +#include "block-backend-io.h"
>>> -/*
>>> - * TODO Have to include block/block.h for a bunch of block layer
>>> - * types.  Unfortunately, this pulls in the whole BlockDriverState
>>> - * API, which we don't want used by many BlockBackend users. Some of
>>> - * the types belong here, and the rest should be split into a common
>>> - * header and one for the BlockDriverState API.
>>> - */
>>> -#include "block/block.h"
>>
>> This note and the include is gone.  Sounds like something positive, 
>> but why is this possible?
>>
>
> Basically block/throttle-groups.h includes block/block_int.h that 
> internally includes block/block.h.
>
> But I am not sure if you actually want to keep this comment as 
> reminder for future work. Should I keep it?

Good question.  I think I’d keep it and the block.h include; I mean, the 
throttle-groups.h include was there before already, so perhaps this was 
indeed only intended as a reminder.

The other reason to keep it is that ideal this is just a refactoring 
patch, so I wouldn’t touch anything that needn’t be touched.

Hanna



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 07/25] assertions for block_int global state API
  2021-11-12 13:51   ` Hanna Reitz
@ 2021-11-16 15:43     ` Emanuele Giuseppe Esposito
  2021-11-16 16:46       ` Hanna Reitz
  0 siblings, 1 reply; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-11-16 15:43 UTC (permalink / raw)
  To: Hanna Reitz, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake



On 12/11/2021 14:51, Hanna Reitz wrote:
> On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>> ---
>>   block.c                         | 17 +++++++++++++++++
>>   block/backup.c                  |  1 +
>>   block/block-backend.c           |  3 +++
>>   block/commit.c                  |  2 ++
>>   block/dirty-bitmap.c            |  1 +
>>   block/io.c                      |  6 ++++++
>>   block/mirror.c                  |  4 ++++
>>   block/monitor/bitmap-qmp-cmds.c |  6 ++++++
>>   block/stream.c                  |  2 ++
>>   blockdev.c                      |  7 +++++++
>>   10 files changed, 49 insertions(+)
>>
>> diff --git a/block.c b/block.c
>> index 672f946065..41c5883c5c 100644
>> --- a/block.c
>> +++ b/block.c
> 
> [...]
> 
>> @@ -7473,6 +7488,7 @@ static bool append_strong_runtime_options(QDict 
>> *d, BlockDriverState *bs)
>>    * would result in exactly bs->backing. */
>>   bool bdrv_backing_overridden(BlockDriverState *bs)
>>   {
>> +    assert(qemu_in_main_thread());
>>       if (bs->backing) {
>>           return strcmp(bs->auto_backing_file,
>>                         bs->backing->bs->filename);
> 
> This function is in block_int-common.h, though.

Can go as GS, since it is under BQL.

(Actually, it is only used in block.c, so if you want I can put it as 
static). Otherwise, I will just move it to GS.

I agree with the rest.

Thank you,
Emanuele

> 
> [...]
> 
>> diff --git a/block/io.c b/block/io.c
>> index c5d7f8495e..f271ab3684 100644
>> --- a/block/io.c
>> +++ b/block/io.c
> 
> [...]
> 
>> @@ -3419,6 +3423,7 @@ int coroutine_fn 
>> bdrv_co_copy_range_from(BdrvChild *src, int64_t src_offset,
>>   {
>>       trace_bdrv_co_copy_range_from(src, src_offset, dst, dst_offset, 
>> bytes,
>>                                     read_flags, write_flags);
>> +    assert(qemu_in_main_thread());
>>       return bdrv_co_copy_range_internal(src, src_offset, dst, 
>> dst_offset,
>>                                          bytes, read_flags, 
>> write_flags, true);
>>   }
> 
> This is a block_int-io.h function.
> 
>> @@ -3435,6 +3440,7 @@ int coroutine_fn bdrv_co_copy_range_to(BdrvChild 
>> *src, int64_t src_offset,
>>   {
>>       trace_bdrv_co_copy_range_to(src, src_offset, dst, dst_offset, 
>> bytes,
>>                                   read_flags, write_flags);
>> +    assert(qemu_in_main_thread());
>>       return bdrv_co_copy_range_internal(src, src_offset, dst, 
>> dst_offset,
>>                                          bytes, read_flags, 
>> write_flags, false);
>>   }
> 
> This, too.
> 
> Hanna
> 



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 07/25] assertions for block_int global state API
  2021-11-16 15:43     ` Emanuele Giuseppe Esposito
@ 2021-11-16 16:46       ` Hanna Reitz
  0 siblings, 0 replies; 86+ messages in thread
From: Hanna Reitz @ 2021-11-16 16:46 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

On 16.11.21 16:43, Emanuele Giuseppe Esposito wrote:
>
>
> On 12/11/2021 14:51, Hanna Reitz wrote:
>> On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
>>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>>> ---
>>>   block.c                         | 17 +++++++++++++++++
>>>   block/backup.c                  |  1 +
>>>   block/block-backend.c           |  3 +++
>>>   block/commit.c                  |  2 ++
>>>   block/dirty-bitmap.c            |  1 +
>>>   block/io.c                      |  6 ++++++
>>>   block/mirror.c                  |  4 ++++
>>>   block/monitor/bitmap-qmp-cmds.c |  6 ++++++
>>>   block/stream.c                  |  2 ++
>>>   blockdev.c                      |  7 +++++++
>>>   10 files changed, 49 insertions(+)
>>>
>>> diff --git a/block.c b/block.c
>>> index 672f946065..41c5883c5c 100644
>>> --- a/block.c
>>> +++ b/block.c
>>
>> [...]
>>
>>> @@ -7473,6 +7488,7 @@ static bool 
>>> append_strong_runtime_options(QDict *d, BlockDriverState *bs)
>>>    * would result in exactly bs->backing. */
>>>   bool bdrv_backing_overridden(BlockDriverState *bs)
>>>   {
>>> +    assert(qemu_in_main_thread());
>>>       if (bs->backing) {
>>>           return strcmp(bs->auto_backing_file,
>>>                         bs->backing->bs->filename);
>>
>> This function is in block_int-common.h, though.
>
> Can go as GS, since it is under BQL.
>
> (Actually, it is only used in block.c, so if you want I can put it as 
> static). Otherwise, I will just move it to GS.

Sounds good to me.

Hanna



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 20/25] block_int-common.h: assertion in the callers of BlockDriver function pointers
  2021-11-15 12:48   ` Hanna Reitz
  2021-11-15 14:15     ` Hanna Reitz
@ 2021-11-17 11:33     ` Emanuele Giuseppe Esposito
  2021-11-17 12:51       ` Hanna Reitz
  1 sibling, 1 reply; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-11-17 11:33 UTC (permalink / raw)
  To: Hanna Reitz, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake



On 15/11/2021 13:48, Hanna Reitz wrote:
> On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>> ---
>>   block.c | 17 +++++++++++++++++
>>   1 file changed, 17 insertions(+)
>>
>> diff --git a/block.c b/block.c
>> index 94bff5c757..40c4729b8d 100644
>> --- a/block.c
>> +++ b/block.c
> 
> [...]
> 
>> @@ -2148,6 +2152,7 @@ static void bdrv_child_perm(BlockDriverState 
>> *bs, BlockDriverState *child_bs,
>>                               uint64_t *nperm, uint64_t *nshared)
>>   {
>>       assert(bs->drv && bs->drv->bdrv_child_perm);
>> +    assert(qemu_in_main_thread());
>>       bs->drv->bdrv_child_perm(bs, c, role, reopen_queue,
>>                                parent_perm, parent_shared,
>>                                nperm, nshared);
> 
> (Should’ve noticed earlier, but only did now...)
> 
> First, this function is indirectly called by bdrv_refresh_perms(). I 
> understand that all perm-related functions are classified as GS.
> 
> However, bdrv_co_invalidate_cache() invokes bdrv_refresh_perms. Being 
> declared in block/coroutine.h, it’s an I/O function, so it mustn’t call 
> such a GS function. BlockDriver.bdrv_co_invalidate_cache(), 
> bdrv_invalidate_cache(), and blk_invalidate_cache() are also classified 
> as I/O functions. Perhaps all of these functions should be classified as 
> GS functions?  I believe their callers and their purpose would allow for 
> this.

I think that the *_invalidate_cache functions are I/O.
First of all, test-block-iothread.c calls bdrv_invalidate_cache in 
test_sync_op_invalidate_cache, which is purposefully called in an 
iothread. So that hints that we want it as I/O.
(Small mistake I just noticed: blk_invalidate_cache has the BQL 
assertion even though it is rightly put in block-backend-io.h

> 
> Second, it’s called by bdrv_child_refresh_perms(), which is called by 
> block_crypto_amend_options_generic_luks().  This function is called by 
> block_crypto_co_amend_luks(), which is a BlockDriver.bdrv_co_amend 
> implementation, which is classified as an I/O function.
> 
> Honestly, I don’t know how to fix that mess.  The best would be if we 
> could make the perm functions thread-safe and classify them as I/O, but 
> it seems to me like that’s impossible (I sure hope I’m wrong).  On the 
> other hand, .bdrv_co_amend very much strikes me like a GS function, but 
> it isn’t.  I’m afraid it must work on nodes that are not in the main 
> context, and it launches a job, so AFAIU we absolutely cannot run it 
> under the BQL.
> 
> It almost seems to me like we’d need a thread-safe variant of the perm 
> functions that’s allowed to fail when it cannot guarantee thread safety 
> or something.  Or perhaps I’m wrong and the perm functions can actually 
> be classified as thread-safe and I/O, that’d be great…

I think that since we are currently only splitting and not taking care 
of the actual I/O thread safety, we can move the _perm functions in I/O, 
and add a nice TODO to double check their thread safety.

I mean, if they are not thread-safe after the split it means they are 
not thread safe also right now.

Emanuele



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 20/25] block_int-common.h: assertion in the callers of BlockDriver function pointers
  2021-11-17 11:33     ` Emanuele Giuseppe Esposito
@ 2021-11-17 12:51       ` Hanna Reitz
  2021-11-17 13:09         ` Emanuele Giuseppe Esposito
  0 siblings, 1 reply; 86+ messages in thread
From: Hanna Reitz @ 2021-11-17 12:51 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

On 17.11.21 12:33, Emanuele Giuseppe Esposito wrote:
>
>
> On 15/11/2021 13:48, Hanna Reitz wrote:
>> On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
>>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>>> ---
>>>   block.c | 17 +++++++++++++++++
>>>   1 file changed, 17 insertions(+)
>>>
>>> diff --git a/block.c b/block.c
>>> index 94bff5c757..40c4729b8d 100644
>>> --- a/block.c
>>> +++ b/block.c
>>
>> [...]
>>
>>> @@ -2148,6 +2152,7 @@ static void bdrv_child_perm(BlockDriverState 
>>> *bs, BlockDriverState *child_bs,
>>>                               uint64_t *nperm, uint64_t *nshared)
>>>   {
>>>       assert(bs->drv && bs->drv->bdrv_child_perm);
>>> +    assert(qemu_in_main_thread());
>>>       bs->drv->bdrv_child_perm(bs, c, role, reopen_queue,
>>>                                parent_perm, parent_shared,
>>>                                nperm, nshared);
>>
>> (Should’ve noticed earlier, but only did now...)
>>
>> First, this function is indirectly called by bdrv_refresh_perms(). I 
>> understand that all perm-related functions are classified as GS.
>>
>> However, bdrv_co_invalidate_cache() invokes bdrv_refresh_perms. Being 
>> declared in block/coroutine.h, it’s an I/O function, so it mustn’t 
>> call such a GS function. BlockDriver.bdrv_co_invalidate_cache(), 
>> bdrv_invalidate_cache(), and blk_invalidate_cache() are also 
>> classified as I/O functions. Perhaps all of these functions should be 
>> classified as GS functions?  I believe their callers and their 
>> purpose would allow for this.
>
> I think that the *_invalidate_cache functions are I/O.
> First of all, test-block-iothread.c calls bdrv_invalidate_cache in 
> test_sync_op_invalidate_cache, which is purposefully called in an 
> iothread. So that hints that we want it as I/O.

Hm, OK, but bdrv_co_invalidate_cache() calls bdrv_refresh_perms(), which 
is a GS function, so that shouldn’t work, right?

> (Small mistake I just noticed: blk_invalidate_cache has the BQL 
> assertion even though it is rightly put in block-backend-io.h
>
>>
>> Second, it’s called by bdrv_child_refresh_perms(), which is called by 
>> block_crypto_amend_options_generic_luks().  This function is called 
>> by block_crypto_co_amend_luks(), which is a BlockDriver.bdrv_co_amend 
>> implementation, which is classified as an I/O function.
>>
>> Honestly, I don’t know how to fix that mess.  The best would be if we 
>> could make the perm functions thread-safe and classify them as I/O, 
>> but it seems to me like that’s impossible (I sure hope I’m wrong).  
>> On the other hand, .bdrv_co_amend very much strikes me like a GS 
>> function, but it isn’t.  I’m afraid it must work on nodes that are 
>> not in the main context, and it launches a job, so AFAIU we 
>> absolutely cannot run it under the BQL.
>>
>> It almost seems to me like we’d need a thread-safe variant of the 
>> perm functions that’s allowed to fail when it cannot guarantee thread 
>> safety or something.  Or perhaps I’m wrong and the perm functions can 
>> actually be classified as thread-safe and I/O, that’d be great…
>
> I think that since we are currently only splitting and not taking care 
> of the actual I/O thread safety, we can move the _perm functions in 
> I/O, and add a nice TODO to double check their thread safety.

:/

I would really, really like to avoid that unless it’s clear that we can 
make them thread-safe, or that there’s a way to take the BQL in I/O 
functions to call GS functions.  But the latter still wouldn’t make the 
perm functions I/O functions.  At most, I’d sort them under common 
functions.

> I mean, if they are not thread-safe after the split it means they are 
> not thread safe also right now.

Yes, sorry I wasn’t clear, I think there’s a pre-existing problem that 
your series only unveils.  I don’t know whether it has implications in 
practice yet.

Hanna



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 20/25] block_int-common.h: assertion in the callers of BlockDriver function pointers
  2021-11-17 12:51       ` Hanna Reitz
@ 2021-11-17 13:09         ` Emanuele Giuseppe Esposito
  2021-11-17 13:34           ` Hanna Reitz
  0 siblings, 1 reply; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-11-17 13:09 UTC (permalink / raw)
  To: Hanna Reitz, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake



On 17/11/2021 13:51, Hanna Reitz wrote:
> On 17.11.21 12:33, Emanuele Giuseppe Esposito wrote:
>>
>>
>> On 15/11/2021 13:48, Hanna Reitz wrote:
>>> On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
>>>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>> ---
>>>>   block.c | 17 +++++++++++++++++
>>>>   1 file changed, 17 insertions(+)
>>>>
>>>> diff --git a/block.c b/block.c
>>>> index 94bff5c757..40c4729b8d 100644
>>>> --- a/block.c
>>>> +++ b/block.c
>>>
>>> [...]
>>>
>>>> @@ -2148,6 +2152,7 @@ static void bdrv_child_perm(BlockDriverState 
>>>> *bs, BlockDriverState *child_bs,
>>>>                               uint64_t *nperm, uint64_t *nshared)
>>>>   {
>>>>       assert(bs->drv && bs->drv->bdrv_child_perm);
>>>> +    assert(qemu_in_main_thread());
>>>>       bs->drv->bdrv_child_perm(bs, c, role, reopen_queue,
>>>>                                parent_perm, parent_shared,
>>>>                                nperm, nshared);
>>>
>>> (Should’ve noticed earlier, but only did now...)
>>>
>>> First, this function is indirectly called by bdrv_refresh_perms(). I 
>>> understand that all perm-related functions are classified as GS.
>>>
>>> However, bdrv_co_invalidate_cache() invokes bdrv_refresh_perms. Being 
>>> declared in block/coroutine.h, it’s an I/O function, so it mustn’t 
>>> call such a GS function. BlockDriver.bdrv_co_invalidate_cache(), 
>>> bdrv_invalidate_cache(), and blk_invalidate_cache() are also 
>>> classified as I/O functions. Perhaps all of these functions should be 
>>> classified as GS functions?  I believe their callers and their 
>>> purpose would allow for this.
>>
>> I think that the *_invalidate_cache functions are I/O.
>> First of all, test-block-iothread.c calls bdrv_invalidate_cache in 
>> test_sync_op_invalidate_cache, which is purposefully called in an 
>> iothread. So that hints that we want it as I/O.
> 
> Hm, OK, but bdrv_co_invalidate_cache() calls bdrv_refresh_perms(), which 
> is a GS function, so that shouldn’t work, right?

Ok let's take a step back for one moment: can you tell me why the perm 
functions should be GS?

On one side I see they are also used by I/O, as we can see above. On the 
other side, I kinda see that permission should only be modified under 
BQL. But I don't have any valid point to sustain that.
So I wonder if you have any specific and more valid reason to put them 
as GS.

Maybe clarifying this will help finding a clean solution to this problem.

> 
>> (Small mistake I just noticed: blk_invalidate_cache has the BQL 
>> assertion even though it is rightly put in block-backend-io.h
>>
>>>
>>> Second, it’s called by bdrv_child_refresh_perms(), which is called by 
>>> block_crypto_amend_options_generic_luks().  This function is called 
>>> by block_crypto_co_amend_luks(), which is a BlockDriver.bdrv_co_amend 
>>> implementation, which is classified as an I/O function.
>>>
>>> Honestly, I don’t know how to fix that mess.  The best would be if we 
>>> could make the perm functions thread-safe and classify them as I/O, 
>>> but it seems to me like that’s impossible (I sure hope I’m wrong). On 
>>> the other hand, .bdrv_co_amend very much strikes me like a GS 
>>> function, but it isn’t.  I’m afraid it must work on nodes that are 
>>> not in the main context, and it launches a job, so AFAIU we 
>>> absolutely cannot run it under the BQL.
>>>
>>> It almost seems to me like we’d need a thread-safe variant of the 
>>> perm functions that’s allowed to fail when it cannot guarantee thread 
>>> safety or something.  Or perhaps I’m wrong and the perm functions can 
>>> actually be classified as thread-safe and I/O, that’d be great…
>>
>> I think that since we are currently only splitting and not taking care 
>> of the actual I/O thread safety, we can move the _perm functions in 
>> I/O, and add a nice TODO to double check their thread safety.
> 
> :/
> 
> I would really, really like to avoid that unless it’s clear that we can 
> make them thread-safe, or that there’s a way to take the BQL in I/O 
> functions to call GS functions.  But the latter still wouldn’t make the 
> perm functions I/O functions.  At most, I’d sort them under common 
> functions.
> 
>> I mean, if they are not thread-safe after the split it means they are 
>> not thread safe also right now.
> 
> Yes, sorry I wasn’t clear, I think there’s a pre-existing problem that 
> your series only unveils.  I don’t know whether it has implications in 
> practice yet.
> 
> Hanna
> 



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 20/25] block_int-common.h: assertion in the callers of BlockDriver function pointers
  2021-11-17 13:09         ` Emanuele Giuseppe Esposito
@ 2021-11-17 13:34           ` Hanna Reitz
  0 siblings, 0 replies; 86+ messages in thread
From: Hanna Reitz @ 2021-11-17 13:34 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

On 17.11.21 14:09, Emanuele Giuseppe Esposito wrote:
>
>
> On 17/11/2021 13:51, Hanna Reitz wrote:
>> On 17.11.21 12:33, Emanuele Giuseppe Esposito wrote:
>>>
>>>
>>> On 15/11/2021 13:48, Hanna Reitz wrote:
>>>> On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
>>>>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>>>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>> ---
>>>>>   block.c | 17 +++++++++++++++++
>>>>>   1 file changed, 17 insertions(+)
>>>>>
>>>>> diff --git a/block.c b/block.c
>>>>> index 94bff5c757..40c4729b8d 100644
>>>>> --- a/block.c
>>>>> +++ b/block.c
>>>>
>>>> [...]
>>>>
>>>>> @@ -2148,6 +2152,7 @@ static void bdrv_child_perm(BlockDriverState 
>>>>> *bs, BlockDriverState *child_bs,
>>>>>                               uint64_t *nperm, uint64_t *nshared)
>>>>>   {
>>>>>       assert(bs->drv && bs->drv->bdrv_child_perm);
>>>>> +    assert(qemu_in_main_thread());
>>>>>       bs->drv->bdrv_child_perm(bs, c, role, reopen_queue,
>>>>>                                parent_perm, parent_shared,
>>>>>                                nperm, nshared);
>>>>
>>>> (Should’ve noticed earlier, but only did now...)
>>>>
>>>> First, this function is indirectly called by bdrv_refresh_perms(). 
>>>> I understand that all perm-related functions are classified as GS.
>>>>
>>>> However, bdrv_co_invalidate_cache() invokes bdrv_refresh_perms. 
>>>> Being declared in block/coroutine.h, it’s an I/O function, so it 
>>>> mustn’t call such a GS function. 
>>>> BlockDriver.bdrv_co_invalidate_cache(), bdrv_invalidate_cache(), 
>>>> and blk_invalidate_cache() are also classified as I/O functions. 
>>>> Perhaps all of these functions should be classified as GS 
>>>> functions?  I believe their callers and their purpose would allow 
>>>> for this.
>>>
>>> I think that the *_invalidate_cache functions are I/O.
>>> First of all, test-block-iothread.c calls bdrv_invalidate_cache in 
>>> test_sync_op_invalidate_cache, which is purposefully called in an 
>>> iothread. So that hints that we want it as I/O.
>>
>> Hm, OK, but bdrv_co_invalidate_cache() calls bdrv_refresh_perms(), 
>> which is a GS function, so that shouldn’t work, right?
>
> Ok let's take a step back for one moment: can you tell me why the perm 
> functions should be GS?
>
> On one side I see they are also used by I/O, as we can see above. On 
> the other side, I kinda see that permission should only be modified 
> under BQL. But I don't have any valid point to sustain that.
> So I wonder if you have any specific and more valid reason to put them 
> as GS.

First I believe permissions to be part of the block graph state, and so 
global state.  But, well, that could be declared just a hunch.

Second permissions have transaction mechanisms – you try to update them 
on every node, if one fails, all are aborted, else all are committed.  
So this is by no means an atomic operation but quite drawn out.

The problem with this is that I/O operations rely on permissions, e.g. 
you’ll get assertion failures when trying to write but don’t have the 
WRITE permission.  So it definitely doesn’t seem like something to me 
that can be thread-safe in the sense of cooperating nicely with other 
I/O threads.

Perhaps it’d be fine to do permission updates while the relevant 
subgraph is drained (i.e. blocking all other I/O threads), but I kind of 
feel like the same could be said for all (other) GS operations.  Like, 
you could probably do all kinds of graph changes while all involved 
subgraphs are drained.

Hanna



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 24/25] job.h: split function pointers in JobDriver
  2021-11-15 15:11   ` Hanna Reitz
@ 2021-11-17 13:43     ` Emanuele Giuseppe Esposito
  2021-11-17 13:44       ` Hanna Reitz
  0 siblings, 1 reply; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-11-17 13:43 UTC (permalink / raw)
  To: Hanna Reitz, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake



On 15/11/2021 16:11, Hanna Reitz wrote:
> On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
>> The job API will be handled separately in another serie.
>>
>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>> ---
>>   include/qemu/job.h | 16 ++++++++++++++++
>>   1 file changed, 16 insertions(+)
>>
>> diff --git a/include/qemu/job.h b/include/qemu/job.h
>> index 6e67b6977f..7e9e59f4b8 100644
>> --- a/include/qemu/job.h
>> +++ b/include/qemu/job.h
>> @@ -169,12 +169,21 @@ typedef struct Job {
>>    * Callbacks and other information about a Job driver.
>>    */
>>   struct JobDriver {
>> +
>> +    /* Fields initialized in struct definition and never changed. */
> 
> Like in patch 19, I’d prefer a slightly more verbose comment that I’d 
> find more easily readable.
> 
>> +
>>       /** Derived Job struct size */
>>       size_t instance_size;
>>       /** Enum describing the operation */
>>       JobType job_type;
>> +    /*
>> +     * Functions run without regard to the BQL and may run in any
> 
> s/and/that/?
> 
>> +     * arbitrary thread. These functions do not need to be thread-safe
>> +     * because the caller ensures that are invoked from one thread at 
>> time.
> 
> s/that/they/ (or “that they”)
> 
> I believe .run() must be run in the job’s context, though.  Not sure if 
> that’s necessary to note, but it isn’t really an arbitrary thread, and 
> block jobs certainly require this (because they run in the block 
> device’s context).  Or is that something that’s going to change with I/O 
> threading?
> 

What about moving .run() before the comment and add "Must be run in the 
job's context" to its comment description?

maybe also add the following assertion in job_co_entry (that calls 
job->run())?

assert(job->aio_context == qemu_get_current_aio_context());

Emanuele



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 24/25] job.h: split function pointers in JobDriver
  2021-11-17 13:43     ` Emanuele Giuseppe Esposito
@ 2021-11-17 13:44       ` Hanna Reitz
  0 siblings, 0 replies; 86+ messages in thread
From: Hanna Reitz @ 2021-11-17 13:44 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

On 17.11.21 14:43, Emanuele Giuseppe Esposito wrote:
>
>
> On 15/11/2021 16:11, Hanna Reitz wrote:
>> On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
>>> The job API will be handled separately in another serie.
>>>
>>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>>> ---
>>>   include/qemu/job.h | 16 ++++++++++++++++
>>>   1 file changed, 16 insertions(+)
>>>
>>> diff --git a/include/qemu/job.h b/include/qemu/job.h
>>> index 6e67b6977f..7e9e59f4b8 100644
>>> --- a/include/qemu/job.h
>>> +++ b/include/qemu/job.h
>>> @@ -169,12 +169,21 @@ typedef struct Job {
>>>    * Callbacks and other information about a Job driver.
>>>    */
>>>   struct JobDriver {
>>> +
>>> +    /* Fields initialized in struct definition and never changed. */
>>
>> Like in patch 19, I’d prefer a slightly more verbose comment that I’d 
>> find more easily readable.
>>
>>> +
>>>       /** Derived Job struct size */
>>>       size_t instance_size;
>>>       /** Enum describing the operation */
>>>       JobType job_type;
>>> +    /*
>>> +     * Functions run without regard to the BQL and may run in any
>>
>> s/and/that/?
>>
>>> +     * arbitrary thread. These functions do not need to be thread-safe
>>> +     * because the caller ensures that are invoked from one thread 
>>> at time.
>>
>> s/that/they/ (or “that they”)
>>
>> I believe .run() must be run in the job’s context, though.  Not sure 
>> if that’s necessary to note, but it isn’t really an arbitrary thread, 
>> and block jobs certainly require this (because they run in the block 
>> device’s context).  Or is that something that’s going to change with 
>> I/O threading?
>>
>
> What about moving .run() before the comment and add "Must be run in 
> the job's context" to its comment description?

Sure, works for me.

> maybe also add the following assertion in job_co_entry (that calls 
> job->run())?
>
> assert(job->aio_context == qemu_get_current_aio_context());

Sounds good!

Hanna



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 08/25] block: introduce assert_bdrv_graph_writable
  2021-11-12 14:40   ` Hanna Reitz
@ 2021-11-18  9:55     ` Emanuele Giuseppe Esposito
  2021-11-18 10:24       ` Emanuele Giuseppe Esposito
  2021-11-18 15:17       ` Hanna Reitz
  0 siblings, 2 replies; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-11-18  9:55 UTC (permalink / raw)
  To: Hanna Reitz, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake


On 12/11/2021 15:40, Hanna Reitz wrote:
> On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
>> We want to be sure that the functions that write the child and
>> parent list of a bs are under BQL and drain.
>>
>> BQL prevents from concurrent writings from the GS API, while
>> drains protect from I/O.
>>
>> TODO: drains are missing in some functions using this assert.
>> Therefore a proper assertion will fail. Because adding drains
>> requires additional discussions, they will be added in future
>> series.
>>
>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>> ---
>>   block.c                                |  5 +++++
>>   block/io.c                             | 11 +++++++++++
>>   include/block/block_int-global-state.h | 10 +++++++++-
>>   3 files changed, 25 insertions(+), 1 deletion(-)
>>
>> diff --git a/block.c b/block.c
>> index 41c5883c5c..94bff5c757 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -2734,12 +2734,14 @@ static void 
>> bdrv_replace_child_noperm(BdrvChild *child,
>>           if (child->klass->detach) {
>>               child->klass->detach(child);
>>           }
>> +        assert_bdrv_graph_writable(old_bs);
>>           QLIST_REMOVE(child, next_parent);
> 
> I think this belongs above the .detach() call (and the QLIST_REMOVE() 
> belongs into the .detach() implementation, as done in 
> https://lists.nongnu.org/archive/html/qemu-block/2021-11/msg00240.html, 
> which has been merged to Kevin’s block branch).

Yes, I rebased on kwolf/block branch. Thank you for pointing that out.
> 
>>       }
>>       child->bs = new_bs;
>>       if (new_bs) {
>> +        assert_bdrv_graph_writable(new_bs);
>>           QLIST_INSERT_HEAD(&new_bs->parents, child, next_parent);
> 
> In both these places it’s a bit strange that the assertion is done on 
> the child nodes.  The subgraph starting from them isn’t modified after 
> all, so their subgraph technically doesn’t need to be writable.  I think 
> a single assertion on the parent node would be preferable.
> 
> I presume the problem with that is that we don’t have the parent node 
> here?  Do we need a new BdrvChildClass method that performs this 
> assertion on the parent node?
> 

Uhm I am not sure what you mean here.

Just to recap on how I see this: the assertion 
assert_bdrv_graph_writable(bs) is basically used to make sure we are 
protecting the write on some fields (childrens and parents lists in this 
patch) of a given @bs. It should work like a rwlock: reading is allowed 
to be concurrent, but a write should stop all readers to prevent 
concurrency issues. This is achieved by draining.

Let's use the first case that you point out, old_bs (it's specular for 
new_bs):

 >> +        assert_bdrv_graph_writable(old_bs);
 >>           QLIST_REMOVE(child, next_parent);

So old_bs should be the child "son" (child->bs), meaning old_bs->parents 
contains the child. Therefore when a child is removed by old_bs, we need 
to be sure we are doing it safely.

So we should check that if old_bs exists, old_bs should be drained, to 
prevent any other iothread from reading the ->parents list that is being 
updated.

The only thing to keep in mind in this case is that just wrapping a 
drain around that won't be enough, because then the child won't be 
included in the drain_end(old_bs). Therefore the right way to cover this 
drain-wise once the assertion also checks for drains is:

drain_begin(old_bs)
assert_bdrv_graph_writable(old_bs)
QLIST_REMOVE(child, next_parent)
/* old_bs will be under drain_end, but not the child */
bdrv_parent_drained_end_single(child);
bdrv_drained_end(old_bs);

I think you agree on this so far.

Now I think your concern is related to the child "parent", namely 
child->opaque. The problem is that in the .detach and .attach callbacks 
we are firstly adding/removing the child from the list, and then calling 
drain on the subtree. We would ideally need to do the opposite:

assert_bdrv_graph_writable(bs);
QLIST_REMOVE(child, next);
bdrv_unapply_subtree_drain(child, bs);

In this case I think this would actually work, because removing/adding 
the child from the ->children list beforehand just prevents an 
additional recursion call (I think, and the fact that tests are passing 
seems to confirm my theory).

Of course you know this stuff better than me, so let me know if 
something here is wrong.

>>           /*
>> @@ -2940,6 +2942,7 @@ static int 
>> bdrv_attach_child_noperm(BlockDriverState *parent_bs,
>>           return ret;
>>       }
>> +    assert_bdrv_graph_writable(parent_bs);
>>       QLIST_INSERT_HEAD(&parent_bs->children, *child, next);
>>       /*
>>        * child is removed in bdrv_attach_child_common_abort(), so 
>> don't care to
>> @@ -3140,6 +3143,7 @@ static void 
>> bdrv_unset_inherits_from(BlockDriverState *root, BdrvChild *child,
>>   void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child)
>>   {
>>       assert(qemu_in_main_thread());
>> +    assert_bdrv_graph_writable(parent);
> 
> It looks to me like we have this assertion mainly because 
> bdrv_replace_child_noperm() doesn’t have a pointer to this parent node. 
> It’s a workaround, but we should have this in every path that eventually 
> ends up at bdrv_replace_child_noperm(), and that seems rather difficult 
> for the bdrv_replace_node() family of functions. That to me sounds like 
> it’d be good to have this as a BdrvChildClass function.

I think this assertion is wrong. There is no ->childrens or ->parents 
manipulation here, it used to be in one of the function that it calls 
internally, but now as you pointed out is moved to .attach and .detach. 
So I will remove this.

Not sure about the BdrvChildClass function, feel free to elaborate more 
if what I wrote above is wrong/does not make sense to you.

Thank you,
Emanuele
> 
>>       if (child == NULL) {
>>           return;
>>       }
>> @@ -4903,6 +4907,7 @@ static void 
>> bdrv_remove_filter_or_cow_child_abort(void *opaque)
>>       BdrvRemoveFilterOrCowChild *s = opaque;
>>       BlockDriverState *parent_bs = s->child->opaque;
>> +    assert_bdrv_graph_writable(parent_bs);
>>       QLIST_INSERT_HEAD(&parent_bs->children, s->child, next);
>>       if (s->is_backing) {
>>           parent_bs->backing = s->child;
>> diff --git a/block/io.c b/block/io.c
>> index f271ab3684..1c71e354d6 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -740,6 +740,17 @@ void bdrv_drain_all(void)
>>       bdrv_drain_all_end();
>>   }
>> +void assert_bdrv_graph_writable(BlockDriverState *bs)
>> +{
>> +    /*
>> +     * TODO: this function is incomplete. Because the users of this
>> +     * assert lack the necessary drains, check only for BQL.
>> +     * Once the necessary drains are added,
>> +     * assert also for qatomic_read(&bs->quiesce_counter) > 0
>> +     */
>> +    assert(qemu_in_main_thread());
>> +}
>> +
>>   /**
>>    * Remove an active request from the tracked requests list
>>    *
>> diff --git a/include/block/block_int-global-state.h 
>> b/include/block/block_int-global-state.h
>> index d08e80222c..6bd7746409 100644
>> --- a/include/block/block_int-global-state.h
>> +++ b/include/block/block_int-global-state.h
>> @@ -316,4 +316,12 @@ void 
>> bdrv_remove_aio_context_notifier(BlockDriverState *bs,
>>    */
>>   void bdrv_drain_all_end_quiesce(BlockDriverState *bs);
>> -#endif /* BLOCK_INT_GLOBAL_STATE*/
>> +/**
>> + * Make sure that the function is either running under
>> + * drain and BQL. The latter protects from concurrent writings
> 
> “either ... and” sounds wrong to me.  I’d drop the “either” or say 
> “running under both drain and BQL”.
> 
> Hanna
> 
>> + * from the GS API, while the former prevents concurrent reads
>> + * from I/O.
>> + */
>> +void assert_bdrv_graph_writable(BlockDriverState *bs);
>> +
>> +#endif /* BLOCK_INT_GLOBAL_STATE */
> 



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 08/25] block: introduce assert_bdrv_graph_writable
  2021-11-18  9:55     ` Emanuele Giuseppe Esposito
@ 2021-11-18 10:24       ` Emanuele Giuseppe Esposito
  2021-11-18 15:17       ` Hanna Reitz
  1 sibling, 0 replies; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-11-18 10:24 UTC (permalink / raw)
  To: Hanna Reitz, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake



On 18/11/2021 10:55, Emanuele Giuseppe Esposito wrote:
> 
> On 12/11/2021 15:40, Hanna Reitz wrote:
>> On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
>>> We want to be sure that the functions that write the child and
>>> parent list of a bs are under BQL and drain.
>>>
>>> BQL prevents from concurrent writings from the GS API, while
>>> drains protect from I/O.
>>>
>>> TODO: drains are missing in some functions using this assert.
>>> Therefore a proper assertion will fail. Because adding drains
>>> requires additional discussions, they will be added in future
>>> series.
>>>
>>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>>> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>>> ---
>>>   block.c                                |  5 +++++
>>>   block/io.c                             | 11 +++++++++++
>>>   include/block/block_int-global-state.h | 10 +++++++++-
>>>   3 files changed, 25 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/block.c b/block.c
>>> index 41c5883c5c..94bff5c757 100644
>>> --- a/block.c
>>> +++ b/block.c
>>> @@ -2734,12 +2734,14 @@ static void 
>>> bdrv_replace_child_noperm(BdrvChild *child,
>>>           if (child->klass->detach) {
>>>               child->klass->detach(child);
>>>           }
>>> +        assert_bdrv_graph_writable(old_bs);
>>>           QLIST_REMOVE(child, next_parent);
>>
>> I think this belongs above the .detach() call (and the QLIST_REMOVE() 
>> belongs into the .detach() implementation, as done in 
>> https://lists.nongnu.org/archive/html/qemu-block/2021-11/msg00240.html, which 
>> has been merged to Kevin’s block branch).
> 
> Yes, I rebased on kwolf/block branch. Thank you for pointing that out.
>>
>>>       }
>>>       child->bs = new_bs;
>>>       if (new_bs) {
>>> +        assert_bdrv_graph_writable(new_bs);
>>>           QLIST_INSERT_HEAD(&new_bs->parents, child, next_parent);
>>
>> In both these places it’s a bit strange that the assertion is done on 
>> the child nodes.  The subgraph starting from them isn’t modified after 
>> all, so their subgraph technically doesn’t need to be writable.  I 
>> think a single assertion on the parent node would be preferable.
>>
>> I presume the problem with that is that we don’t have the parent node 
>> here?  Do we need a new BdrvChildClass method that performs this 
>> assertion on the parent node?
>>
> 
> Uhm I am not sure what you mean here.
> 
> Just to recap on how I see this: the assertion 
> assert_bdrv_graph_writable(bs) is basically used to make sure we are 
> protecting the write on some fields (childrens and parents lists in this 
> patch) of a given @bs. It should work like a rwlock: reading is allowed 
> to be concurrent, but a write should stop all readers to prevent 
> concurrency issues. This is achieved by draining.

I am thinking to add an additional explanation to 
assert_bdrv_graph_writable header comment by saying
"Drains act as a rwlock: while reading is allowed to be concurrent from 
all iothreads, when a write needs to be performed we need to stop 
(drain) all involved iothreads from reading the graph, to avoid race 
conditions."

Somethink like that.

Emanuele
> 
> Let's use the first case that you point out, old_bs (it's specular for 
> new_bs):
> 
>  >> +        assert_bdrv_graph_writable(old_bs);
>  >>           QLIST_REMOVE(child, next_parent);
> 
> So old_bs should be the child "son" (child->bs), meaning old_bs->parents 
> contains the child. Therefore when a child is removed by old_bs, we need 
> to be sure we are doing it safely.
> 
> So we should check that if old_bs exists, old_bs should be drained, to 
> prevent any other iothread from reading the ->parents list that is being 
> updated.
> 
> The only thing to keep in mind in this case is that just wrapping a 
> drain around that won't be enough, because then the child won't be 
> included in the drain_end(old_bs). Therefore the right way to cover this 
> drain-wise once the assertion also checks for drains is:
> 
> drain_begin(old_bs)
> assert_bdrv_graph_writable(old_bs)
> QLIST_REMOVE(child, next_parent)
> /* old_bs will be under drain_end, but not the child */
> bdrv_parent_drained_end_single(child);
> bdrv_drained_end(old_bs);
> 
> I think you agree on this so far.
> 
> Now I think your concern is related to the child "parent", namely 
> child->opaque. The problem is that in the .detach and .attach callbacks 
> we are firstly adding/removing the child from the list, and then calling 
> drain on the subtree. We would ideally need to do the opposite:
> 
> assert_bdrv_graph_writable(bs);
> QLIST_REMOVE(child, next);
> bdrv_unapply_subtree_drain(child, bs);
> 
> In this case I think this would actually work, because removing/adding 
> the child from the ->children list beforehand just prevents an 
> additional recursion call (I think, and the fact that tests are passing 
> seems to confirm my theory).
> 
> Of course you know this stuff better than me, so let me know if 
> something here is wrong.
> 
>>>           /*
>>> @@ -2940,6 +2942,7 @@ static int 
>>> bdrv_attach_child_noperm(BlockDriverState *parent_bs,
>>>           return ret;
>>>       }
>>> +    assert_bdrv_graph_writable(parent_bs);
>>>       QLIST_INSERT_HEAD(&parent_bs->children, *child, next);
>>>       /*
>>>        * child is removed in bdrv_attach_child_common_abort(), so 
>>> don't care to
>>> @@ -3140,6 +3143,7 @@ static void 
>>> bdrv_unset_inherits_from(BlockDriverState *root, BdrvChild *child,
>>>   void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child)
>>>   {
>>>       assert(qemu_in_main_thread());
>>> +    assert_bdrv_graph_writable(parent);
>>
>> It looks to me like we have this assertion mainly because 
>> bdrv_replace_child_noperm() doesn’t have a pointer to this parent 
>> node. It’s a workaround, but we should have this in every path that 
>> eventually ends up at bdrv_replace_child_noperm(), and that seems 
>> rather difficult for the bdrv_replace_node() family of functions. That 
>> to me sounds like it’d be good to have this as a BdrvChildClass function.
> 
> I think this assertion is wrong. There is no ->childrens or ->parents 
> manipulation here, it used to be in one of the function that it calls 
> internally, but now as you pointed out is moved to .attach and .detach. 
> So I will remove this.
> 
> Not sure about the BdrvChildClass function, feel free to elaborate more 
> if what I wrote above is wrong/does not make sense to you.
> 
> Thank you,
> Emanuele
>>
>>>       if (child == NULL) {
>>>           return;
>>>       }
>>> @@ -4903,6 +4907,7 @@ static void 
>>> bdrv_remove_filter_or_cow_child_abort(void *opaque)
>>>       BdrvRemoveFilterOrCowChild *s = opaque;
>>>       BlockDriverState *parent_bs = s->child->opaque;
>>> +    assert_bdrv_graph_writable(parent_bs);
>>>       QLIST_INSERT_HEAD(&parent_bs->children, s->child, next);
>>>       if (s->is_backing) {
>>>           parent_bs->backing = s->child;
>>> diff --git a/block/io.c b/block/io.c
>>> index f271ab3684..1c71e354d6 100644
>>> --- a/block/io.c
>>> +++ b/block/io.c
>>> @@ -740,6 +740,17 @@ void bdrv_drain_all(void)
>>>       bdrv_drain_all_end();
>>>   }
>>> +void assert_bdrv_graph_writable(BlockDriverState *bs)
>>> +{
>>> +    /*
>>> +     * TODO: this function is incomplete. Because the users of this
>>> +     * assert lack the necessary drains, check only for BQL.
>>> +     * Once the necessary drains are added,
>>> +     * assert also for qatomic_read(&bs->quiesce_counter) > 0
>>> +     */
>>> +    assert(qemu_in_main_thread());
>>> +}
>>> +
>>>   /**
>>>    * Remove an active request from the tracked requests list
>>>    *
>>> diff --git a/include/block/block_int-global-state.h 
>>> b/include/block/block_int-global-state.h
>>> index d08e80222c..6bd7746409 100644
>>> --- a/include/block/block_int-global-state.h
>>> +++ b/include/block/block_int-global-state.h
>>> @@ -316,4 +316,12 @@ void 
>>> bdrv_remove_aio_context_notifier(BlockDriverState *bs,
>>>    */
>>>   void bdrv_drain_all_end_quiesce(BlockDriverState *bs);
>>> -#endif /* BLOCK_INT_GLOBAL_STATE*/
>>> +/**
>>> + * Make sure that the function is either running under
>>> + * drain and BQL. The latter protects from concurrent writings
>>
>> “either ... and” sounds wrong to me.  I’d drop the “either” or say 
>> “running under both drain and BQL”.
>>
>> Hanna
>>
>>> + * from the GS API, while the former prevents concurrent reads
>>> + * from I/O.
>>> + */
>>> +void assert_bdrv_graph_writable(BlockDriverState *bs);
>>> +
>>> +#endif /* BLOCK_INT_GLOBAL_STATE */
>>



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 19/25] block_int-common.h: split function pointers in BlockDriver
  2021-11-15 12:00   ` Hanna Reitz
@ 2021-11-18 12:42     ` Emanuele Giuseppe Esposito
  0 siblings, 0 replies; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-11-18 12:42 UTC (permalink / raw)
  To: Hanna Reitz, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake



On 15/11/2021 13:00, Hanna Reitz wrote:
>> +
>> +    /*
>> +     * I/O API functions. These functions are thread-safe.
>> +     *
>> +     * See include/block/block-io.h for more information about
>> +     * the I/O API.
>> +     */
>> +
>> +    int coroutine_fn (*bdrv_co_create)(BlockdevCreateOptions *opts,
>> +                                       Error **errp);
>> +    int coroutine_fn (*bdrv_co_create_opts)(BlockDriver *drv,
>> +                                            const char *filename,
>> +                                            QemuOpts *opts,
>> +                                            Error **errp);
> 
> Now this is really interesting.  Technically I suppose these should work 
> in any thread, but trying to do so results in:
> 
> $ touch /tmp/iothread-create-test.qcow2
> $ ./qemu-system-x86_64 -object iothread,id=iothr0 -qmp stdio <<EOF
> {"execute": "qmp_capabilities"}
> {"execute":"blockdev-add","arguments":{"node-name":"proto","driver":"file","filename":"/tmp/iothread-create-test.qcow2"}} 
> 
> {"execute":"x-blockdev-set-iothread","arguments":{"node-name":"proto","iothread":"iothr0"}} 
> 
> {"execute":"blockdev-create","arguments":{"job-id":"create","options":{"driver":"qcow2","file":"proto","size":0}}} 
> 
> EOF
> {"QMP": {"version": {"qemu": {"micro": 90, "minor": 1, "major": 6}, 
> "package": "v6.2.0-rc0-40-gd02d5fe5fb-dirty"}, "capabilities": ["oob"]}}
> {"return": {}}
> {"return": {}}
> {"return": {}}
> {"timestamp": {"seconds": 1636973542, "microseconds": 338117}, "event": 
> "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "create"}}
> {"timestamp": {"seconds": 1636973542, "microseconds": 338197}, "event": 
> "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "create"}}
> {"return": {}}
> qemu: qemu_mutex_unlock_impl: Operation not permitted
> [1]    86154 IOT instruction (core dumped)  ./qemu-system-x86_64 -object 
> iothread,id=iothr0 -qmp stdio <<<''
> 
> So something’s fishy and perhaps we should investigate this...  I mean, 
> I can’t really imagine a case where someone would need to run a 
> blockdev-create job in an I/O thread, but right now the interface allows 
> for it.
> 
> And then bdrv_create() is classified as global state, and also 
> bdrv_co_create_opts_simple(), which is supposed to be a drop-in function 
> for this .bdrv_co_create_opts function.  So that can’t work.
> 
> Also, I believe there might have been some functions you classified as 
> GS that are called from .create* implementations.  I accepted that, 
> given the abort I sketched above.  However, if we classify image 
> creation as I/O, then those would need to be re-evaluated. For example, 
> qcow2_co_create_opts() calls bdrv_create_file(), which is a GS function.
> 
> Some of this issues could be addressed by making .bdrv_co_create_opts a 
> GS function and .bdrv_co_create an I/O function.  I believe that would 
> be the ideal split, even though as shown above .bdrv_co_create doesn’t 
> work in an I/O thread, and then you have the issue of probably all 
> format drivers’ .bdrv_co_create implementations calling 
> bdrv_open_blockdev_ref(), which is a GS function.
> 
> (VMDK even calls blk_new_open(), blk_new_with_bs(), and blk_unref(), 
> none of which can ever be I/O functions, I think.)
> 
> I believe in practice the best is to for now classify all create-related 
> functions as GS functions.  This is supported by the fact that 
> qmp_blockdev_create() specifically creates the create job in the main 
> context (with a TODO comment) and requires block drivers to error out 
> when they encounter a node in a different AioContext.
> 

Ok after better reviewing this I agree with you:
- .bdrv_co_create_opts is for sure a GS function. It is called by 
bdrv_create and it is asserted to be under BQL.
- .bdrv_co_create should also be a GS, and the easiest thing to do would 
be to follow the existing TODO and make sure we cannot run it outside 
the main loop. I think that I will put it as GS, and add the BQL 
assertion to blockdev_create_run, so that if for some reasons someone 
tries to do what you did above, will crash because of the assertion, and 
not because of the aiocontext lock missing.

Emanuele



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 00/25] block layer: split block APIs in global state and I/O
  2021-11-15 16:03 ` Hanna Reitz
  2021-11-15 16:11   ` Daniel P. Berrangé
@ 2021-11-18 13:50   ` Paolo Bonzini
  2021-11-18 15:31     ` Hanna Reitz
  2021-11-18 14:04   ` Paolo Bonzini
  2 siblings, 1 reply; 86+ messages in thread
From: Paolo Bonzini @ 2021-11-18 13:50 UTC (permalink / raw)
  To: Hanna Reitz, Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, Eric Blake, Richard Henderson,
	qemu-devel, Markus Armbruster, Stefan Hajnoczi, John Snow,
	Dr. David Alan Gilbert

On 11/15/21 17:03, Hanna Reitz wrote:
> 
> I only really see four solutions for this:
> (1) We somehow make the amend job run in the main context under the BQL 
> and have it prevent all concurrent I/O access (seems bad)
> (2) We can make the permission functions part of the I/O path (seems 
> wrong and probably impossible?)
> (3) We can drop the permissions update and permanently require the 
> permissions that we need when updating keys (I think this might break 
> existing use cases)
> (4) We can acquire the BQL around the permission update call and perhaps 
> that works?
> 
> I don’t know how (4) would work but it’s basically the only reasonable 
> solution I can come up with.  Would this be a way to call a BQL function 
> from an I/O function?

I think that would deadlock:

	main				I/O thread
	--------			-----
	start bdrv_co_amend
					take BQL
	bdrv_drain
	... hangs ...

(2) is definitely wrong.

(3) I have no idea.

Would it be possible or meaningful to do the bdrv_child_refresh_perms in 
qmp_x_blockdev_amend?  It seems that all users need it, and in general 
it seems weird to amend a qcow2 or luks header (and thus the meaning of 
parts of the file) while others can write to the same file.

Paolo



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 00/25] block layer: split block APIs in global state and I/O
  2021-11-15 16:03 ` Hanna Reitz
  2021-11-15 16:11   ` Daniel P. Berrangé
  2021-11-18 13:50   ` Paolo Bonzini
@ 2021-11-18 14:04   ` Paolo Bonzini
  2021-11-18 15:22     ` Hanna Reitz
  2 siblings, 1 reply; 86+ messages in thread
From: Paolo Bonzini @ 2021-11-18 14:04 UTC (permalink / raw)
  To: Hanna Reitz, Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, Eric Blake, Richard Henderson,
	qemu-devel, Markus Armbruster, Stefan Hajnoczi, John Snow,
	Dr. David Alan Gilbert

On 11/15/21 17:03, Hanna Reitz wrote:
> and second fuse_do_truncate(), which calls blk_set_perm().

Here it seems that a non-growable export is still growable as long as 
nobody is watching. :)  Is this the desired behavior?

Paolo



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 08/25] block: introduce assert_bdrv_graph_writable
  2021-11-18  9:55     ` Emanuele Giuseppe Esposito
  2021-11-18 10:24       ` Emanuele Giuseppe Esposito
@ 2021-11-18 15:17       ` Hanna Reitz
  2021-11-19  8:55         ` Emanuele Giuseppe Esposito
  1 sibling, 1 reply; 86+ messages in thread
From: Hanna Reitz @ 2021-11-18 15:17 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake

On 18.11.21 10:55, Emanuele Giuseppe Esposito wrote:
>
> On 12/11/2021 15:40, Hanna Reitz wrote:
>> On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
>>> We want to be sure that the functions that write the child and
>>> parent list of a bs are under BQL and drain.
>>>
>>> BQL prevents from concurrent writings from the GS API, while
>>> drains protect from I/O.
>>>
>>> TODO: drains are missing in some functions using this assert.
>>> Therefore a proper assertion will fail. Because adding drains
>>> requires additional discussions, they will be added in future
>>> series.
>>>
>>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>>> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>>> ---
>>>   block.c                                |  5 +++++
>>>   block/io.c                             | 11 +++++++++++
>>>   include/block/block_int-global-state.h | 10 +++++++++-
>>>   3 files changed, 25 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/block.c b/block.c
>>> index 41c5883c5c..94bff5c757 100644
>>> --- a/block.c
>>> +++ b/block.c
>>> @@ -2734,12 +2734,14 @@ static void 
>>> bdrv_replace_child_noperm(BdrvChild *child,
>>>           if (child->klass->detach) {
>>>               child->klass->detach(child);
>>>           }
>>> +        assert_bdrv_graph_writable(old_bs);
>>>           QLIST_REMOVE(child, next_parent);
>>
>> I think this belongs above the .detach() call (and the QLIST_REMOVE() 
>> belongs into the .detach() implementation, as done in 
>> https://lists.nongnu.org/archive/html/qemu-block/2021-11/msg00240.html, 
>> which has been merged to Kevin’s block branch).
>
> Yes, I rebased on kwolf/block branch. Thank you for pointing that out.
>>
>>>       }
>>>       child->bs = new_bs;
>>>       if (new_bs) {
>>> +        assert_bdrv_graph_writable(new_bs);
>>>           QLIST_INSERT_HEAD(&new_bs->parents, child, next_parent);
>>
>> In both these places it’s a bit strange that the assertion is done on 
>> the child nodes.  The subgraph starting from them isn’t modified 
>> after all, so their subgraph technically doesn’t need to be 
>> writable.  I think a single assertion on the parent node would be 
>> preferable.
>>
>> I presume the problem with that is that we don’t have the parent node 
>> here?  Do we need a new BdrvChildClass method that performs this 
>> assertion on the parent node?
>>
>
> Uhm I am not sure what you mean here.
>
> Just to recap on how I see this: the assertion 
> assert_bdrv_graph_writable(bs) is basically used to make sure we are 
> protecting the write on some fields (childrens and parents lists in 
> this patch) of a given @bs.

Oh, OK.  I understood it to mean that the subgraph starting at `bs` is 
mutable, i.e. including all of its recursive children.

And yes, you’re right, the child BDSs are indeed modified, too, so we 
should check them, too.

> It should work like a rwlock: reading is allowed to be concurrent, but 
> a write should stop all readers to prevent concurrency issues. This is 
> achieved by draining.

Draining works on a subgraph, so I suppose it does mean that the whole 
subgraph will be mutable.  But no matter, at least new_bs is not in the 
parent’s subgraph, so it wouldn’t be included in the check if we only 
checked the parent.

> Let's use the first case that you point out, old_bs (it's specular for 
> new_bs):
>
> >> +        assert_bdrv_graph_writable(old_bs);
> >>           QLIST_REMOVE(child, next_parent);
>
> So old_bs should be the child "son" (child->bs), meaning 
> old_bs->parents contains the child. Therefore when a child is removed 
> by old_bs, we need to be sure we are doing it safely.
>
> So we should check that if old_bs exists, old_bs should be drained, to 
> prevent any other iothread from reading the ->parents list that is 
> being updated.
>
> The only thing to keep in mind in this case is that just wrapping a 
> drain around that won't be enough, because then the child won't be 
> included in the drain_end(old_bs). Therefore the right way to cover 
> this drain-wise once the assertion also checks for drains is:
>
> drain_begin(old_bs)
> assert_bdrv_graph_writable(old_bs)
> QLIST_REMOVE(child, next_parent)
> /* old_bs will be under drain_end, but not the child */
> bdrv_parent_drained_end_single(child);
> bdrv_drained_end(old_bs);
>
> I think you agree on this so far.
>
> Now I think your concern is related to the child "parent", namely 
> child->opaque. The problem is that in the .detach and .attach 
> callbacks we are firstly adding/removing the child from the list, and 
> then calling drain on the subtree.

It was my impression that you’d want bdrv_replace_child_noperm() to 
always be called in a section where the subgraph starting from the 
parent BDS is drained, not just the child BDSs that are swapped (old_bs 
and new_bs).

My abstract concern is that bdrv_replace_child_noperm() does not modify 
only old_bs and new_bs, but actually modifies the whole subgraph 
starting at the parent node.  A concrete example for this is that we 
modify not only the children’s parent lists, but also the parent’s 
children list.

That’s why I’m asking why we only check that the graph is writable for 
the children, when actually I feel like we’re modifying the parent, too.

> We would ideally need to do the opposite:
>
> assert_bdrv_graph_writable(bs);
> QLIST_REMOVE(child, next);
> bdrv_unapply_subtree_drain(child, bs);
>
> In this case I think this would actually work, because removing/adding 
> the child from the ->children list beforehand just prevents an 
> additional recursion call (I think, and the fact that tests are 
> passing seems to confirm my theory).
>
> Of course you know this stuff better than me, so let me know if 
> something here is wrong.

Well.  I’m mostly wondering why you’re discussing how to do the drain 
right when I was mostly curious about why we check the children and not 
the parent for whether the graph is mutable at their respective 
position. O:)

It was my impression that so far we mostly wrapped graph change 
operations in drained sections (starting at the parent) and not leave it 
to bdrv_replace_child_noperm() to do so.  That function only deals with 
drain stuff because it balances the drain counters on old_bs and new_bs 
to match the counter on the parent’s subgraph.

Hanna



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 00/25] block layer: split block APIs in global state and I/O
  2021-11-18 14:04   ` Paolo Bonzini
@ 2021-11-18 15:22     ` Hanna Reitz
  0 siblings, 0 replies; 86+ messages in thread
From: Hanna Reitz @ 2021-11-18 15:22 UTC (permalink / raw)
  To: Paolo Bonzini, Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, Eric Blake, Richard Henderson,
	qemu-devel, Markus Armbruster, Stefan Hajnoczi, John Snow,
	Dr. David Alan Gilbert

On 18.11.21 15:04, Paolo Bonzini wrote:
> On 11/15/21 17:03, Hanna Reitz wrote:
>> and second fuse_do_truncate(), which calls blk_set_perm().
>
> Here it seems that a non-growable export is still growable as long as 
> nobody is watching. :)  Is this the desired behavior?

Yes, absolutely.  “Growable” is documented to mean that writes after the 
end of the exported file will grow it to fit.  Explicit truncating is 
something else, and I believe we should allow it on all writable 
exports.  (Of course only when other potential users of the block node 
in question allow it to be resized, but that’s what the permission is for.)

Hanna



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 00/25] block layer: split block APIs in global state and I/O
  2021-11-18 13:50   ` Paolo Bonzini
@ 2021-11-18 15:31     ` Hanna Reitz
  2021-11-19  3:13       ` Paolo Bonzini
  0 siblings, 1 reply; 86+ messages in thread
From: Hanna Reitz @ 2021-11-18 15:31 UTC (permalink / raw)
  To: Paolo Bonzini, Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, Eric Blake, Richard Henderson,
	qemu-devel, Markus Armbruster, Stefan Hajnoczi, John Snow,
	Dr. David Alan Gilbert

On 18.11.21 14:50, Paolo Bonzini wrote:
> On 11/15/21 17:03, Hanna Reitz wrote:
>>
>> I only really see four solutions for this:
>> (1) We somehow make the amend job run in the main context under the 
>> BQL and have it prevent all concurrent I/O access (seems bad)
>> (2) We can make the permission functions part of the I/O path (seems 
>> wrong and probably impossible?)
>> (3) We can drop the permissions update and permanently require the 
>> permissions that we need when updating keys (I think this might break 
>> existing use cases)
>> (4) We can acquire the BQL around the permission update call and 
>> perhaps that works?
>>
>> I don’t know how (4) would work but it’s basically the only 
>> reasonable solution I can come up with.  Would this be a way to call 
>> a BQL function from an I/O function?
>
> I think that would deadlock:
>
>     main                I/O thread
>     --------            -----
>     start bdrv_co_amend
>                     take BQL
>     bdrv_drain
>     ... hangs ...

:/

Is there really nothing we can do?  Forgive me if I’m talking complete 
nonsense here (because frankly I don’t even really know what a bottom 
half is exactly), but can’t we schedule some coroutine in the main 
thread to do the perm notifications and wait for them in the I/O thread?

> (2) is definitely wrong.
>
> (3) I have no idea.
>
> Would it be possible or meaningful to do the bdrv_child_refresh_perms 
> in qmp_x_blockdev_amend?  It seems that all users need it, and in 
> general it seems weird to amend a qcow2 or luks header (and thus the 
> meaning of parts of the file) while others can write to the same file.

Hmm...  Perhaps.  We would need to undo the permission change when the 
job finishes, though, i.e. in JobDriver.prepare() or JobDriver.clean().  
Doing the change in qmp_x_blockdev_amend() would be asymmetric then, so 
we’d probably want a new JobDriver method that runs in the main thread 
before .run() is invoked. (Unfortunately, “.prepare()” is now taken 
already...)

Doesn’t solve the FUSE problem, but there we could try to just take the 
RESIZE permission permanently and if that fails, we just don’t allow 
truncates for that export.  Not nice, but should work for common cases.

Hanna



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 00/25] block layer: split block APIs in global state and I/O
  2021-11-18 15:31     ` Hanna Reitz
@ 2021-11-19  3:13       ` Paolo Bonzini
  2021-11-19 10:42         ` Emanuele Giuseppe Esposito
  0 siblings, 1 reply; 86+ messages in thread
From: Paolo Bonzini @ 2021-11-19  3:13 UTC (permalink / raw)
  To: Hanna Reitz
  Cc: Emanuele Giuseppe Esposito, Kevin Wolf,
	Vladimir Sementsov-Ogievskiy, Daniel P. Berrangé,
	Eduardo Habkost, open list:Block layer core, Juan Quintela,
	Eric Blake, Richard Henderson, qemu-devel, Markus Armbruster,
	Stefan Hajnoczi, Fam Zheng, John Snow, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 2589 bytes --]

El jue., 18 nov. 2021 16:31, Hanna Reitz <hreitz@redhat.com> escribió:

> On 18.11.21 14:50, Paolo Bonzini wrote:
> > On 11/15/21 17:03, Hanna Reitz wrote:
> >>
> >> I only really see four solutions for this:
> >> (1) We somehow make the amend job run in the main context under the
> >> BQL and have it prevent all concurrent I/O access (seems bad)
> >> (2) We can make the permission functions part of the I/O path (seems
> >> wrong and probably impossible?)
> >> (3) We can drop the permissions update and permanently require the
> >> permissions that we need when updating keys (I think this might break
> >> existing use cases)
> >> (4) We can acquire the BQL around the permission update call and
> >> perhaps that works?
> >>
> >> I don’t know how (4) would work but it’s basically the only
> >> reasonable solution I can come up with.  Would this be a way to call
> >> a BQL function from an I/O function?
> >
> > I think that would deadlock:
> >
> >     main                I/O thread
> >     --------            -----
> >     start bdrv_co_amend
> >                     take BQL
> >     bdrv_drain
> >     ... hangs ...
>
> :/
>
> Is there really nothing we can do?  Forgive me if I’m talking complete
> nonsense here (because frankly I don’t even really know what a bottom
> half is exactly), but can’t we schedule some coroutine in the main
> thread to do the perm notifications and wait for them in the I/O thread?
>

I think you still get a deadlock, just one with a longer chain. You still
have a cycle of things depending on each other, but one of them is now the
I/O thread waiting for the bottom half.

Hmm...  Perhaps.  We would need to undo the permission change when the
> job finishes, though, i.e. in JobDriver.prepare() or JobDriver.clean().
> Doing the change in qmp_x_blockdev_amend() would be asymmetric then, so
> we’d probably want a new JobDriver method that runs in the main thread
> before .run() is invoked. (Unfortunately, “.prepare()” is now taken
> already...)
>

Ok at least it's feasible.

Doesn’t solve the FUSE problem, but there we could try to just take the
> RESIZE permission permanently and if that fails, we just don’t allow
> truncates for that export.  Not nice, but should work for common cases.
>

Yeah definitely not nice. Probably permissions could be protected by their
own mutex, even a global one like the one we have for jobs. For now I
suggest just ignoring the problem and adding a comment, since it's not
really something that didn't exist.

Paolo

[-- Attachment #2: Type: text/html, Size: 3714 bytes --]

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 08/25] block: introduce assert_bdrv_graph_writable
  2021-11-18 15:17       ` Hanna Reitz
@ 2021-11-19  8:55         ` Emanuele Giuseppe Esposito
  0 siblings, 0 replies; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-11-19  8:55 UTC (permalink / raw)
  To: Hanna Reitz, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, Juan Quintela, qemu-devel, John Snow,
	Richard Henderson, Markus Armbruster, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, Eric Blake



On 18/11/2021 16:17, Hanna Reitz wrote:
> On 18.11.21 10:55, Emanuele Giuseppe Esposito wrote:
>>
>> On 12/11/2021 15:40, Hanna Reitz wrote:
>>> On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
>>>> We want to be sure that the functions that write the child and
>>>> parent list of a bs are under BQL and drain.
>>>>
>>>> BQL prevents from concurrent writings from the GS API, while
>>>> drains protect from I/O.
>>>>
>>>> TODO: drains are missing in some functions using this assert.
>>>> Therefore a proper assertion will fail. Because adding drains
>>>> requires additional discussions, they will be added in future
>>>> series.
>>>>
>>>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>>>> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
>>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>> ---
>>>>   block.c                                |  5 +++++
>>>>   block/io.c                             | 11 +++++++++++
>>>>   include/block/block_int-global-state.h | 10 +++++++++-
>>>>   3 files changed, 25 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/block.c b/block.c
>>>> index 41c5883c5c..94bff5c757 100644
>>>> --- a/block.c
>>>> +++ b/block.c
>>>> @@ -2734,12 +2734,14 @@ static void 
>>>> bdrv_replace_child_noperm(BdrvChild *child,
>>>>           if (child->klass->detach) {
>>>>               child->klass->detach(child);
>>>>           }
>>>> +        assert_bdrv_graph_writable(old_bs);
>>>>           QLIST_REMOVE(child, next_parent);
>>>
>>> I think this belongs above the .detach() call (and the QLIST_REMOVE() 
>>> belongs into the .detach() implementation, as done in 
>>> https://lists.nongnu.org/archive/html/qemu-block/2021-11/msg00240.html, 
>>> which has been merged to Kevin’s block branch).
>>
>> Yes, I rebased on kwolf/block branch. Thank you for pointing that out.
>>>
>>>>       }
>>>>       child->bs = new_bs;
>>>>       if (new_bs) {
>>>> +        assert_bdrv_graph_writable(new_bs);
>>>>           QLIST_INSERT_HEAD(&new_bs->parents, child, next_parent);
>>>
>>> In both these places it’s a bit strange that the assertion is done on 
>>> the child nodes.  The subgraph starting from them isn’t modified 
>>> after all, so their subgraph technically doesn’t need to be 
>>> writable.  I think a single assertion on the parent node would be 
>>> preferable.
>>>
>>> I presume the problem with that is that we don’t have the parent node 
>>> here?  Do we need a new BdrvChildClass method that performs this 
>>> assertion on the parent node?
>>>
>>
>> Uhm I am not sure what you mean here.
>>
>> Just to recap on how I see this: the assertion 
>> assert_bdrv_graph_writable(bs) is basically used to make sure we are 
>> protecting the write on some fields (childrens and parents lists in 
>> this patch) of a given @bs.
> 
> Oh, OK.  I understood it to mean that the subgraph starting at `bs` is 
> mutable, i.e. including all of its recursive children.

Yes, sorry for the confusion. We want to drain the subgraph starting at 
`bs`, also because we modify both parent's .children list and children's 
.parent list.

> 
> And yes, you’re right, the child BDSs are indeed modified, too, so we 
> should check them, too.
> 
>> It should work like a rwlock: reading is allowed to be concurrent, but 
>> a write should stop all readers to prevent concurrency issues. This is 
>> achieved by draining.
> 
> Draining works on a subgraph, so I suppose it does mean that the whole 
> subgraph will be mutable.  But no matter, at least new_bs is not in the 
> parent’s subgraph, so it wouldn’t be included in the check if we only 
> checked the parent.
> 
>> Let's use the first case that you point out, old_bs (it's specular for 
>> new_bs):
>>
>> >> +        assert_bdrv_graph_writable(old_bs);
>> >>           QLIST_REMOVE(child, next_parent);
>>
>> So old_bs should be the child "son" (child->bs), meaning 
>> old_bs->parents contains the child. Therefore when a child is removed 
>> by old_bs, we need to be sure we are doing it safely.
>>
>> So we should check that if old_bs exists, old_bs should be drained, to 
>> prevent any other iothread from reading the ->parents list that is 
>> being updated.
>>
>> The only thing to keep in mind in this case is that just wrapping a 
>> drain around that won't be enough, because then the child won't be 
>> included in the drain_end(old_bs). Therefore the right way to cover 
>> this drain-wise once the assertion also checks for drains is:
>>
>> drain_begin(old_bs)
>> assert_bdrv_graph_writable(old_bs)
>> QLIST_REMOVE(child, next_parent)
>> /* old_bs will be under drain_end, but not the child */
>> bdrv_parent_drained_end_single(child);
>> bdrv_drained_end(old_bs);
>>
>> I think you agree on this so far.
>>
>> Now I think your concern is related to the child "parent", namely 
>> child->opaque. The problem is that in the .detach and .attach 
>> callbacks we are firstly adding/removing the child from the list, and 
>> then calling drain on the subtree.
> 
> It was my impression that you’d want bdrv_replace_child_noperm() to 
> always be called in a section where the subgraph starting from the 
> parent BDS is drained, not just the child BDSs that are swapped (old_bs 
> and new_bs).
> 
> My abstract concern is that bdrv_replace_child_noperm() does not modify 
> only old_bs and new_bs, but actually modifies the whole subgraph 
> starting at the parent node.  A concrete example for this is that we 
> modify not only the children’s parent lists, but also the parent’s 
> children list.
> 
> That’s why I’m asking why we only check that the graph is writable for 
> the children, when actually I feel like we’re modifying the parent, too.

Ok I think I understood what you mean, and I think I addressed this in 
the new series version but was not addressed here.

Maybe what I do is redundant, but I:
1) drain the childrens when we swap them
2) modify .attach and .detach to drain child->opaque (parent) too.
More precisely, I think it should be enough to change 
bdrv_child_cb_attach/detach in the way I showed below, since we are 
touching the parent's .children list only there.
blk_root_attach/detach seem not to deal with its children, so a drain 
there is not necessary.

This should cover everything we need.

> 
>> We would ideally need to do the opposite:
>>
>> assert_bdrv_graph_writable(bs);
>> QLIST_REMOVE(child, next);
>> bdrv_unapply_subtree_drain(child, bs);
>>
>> In this case I think this would actually work, because removing/adding 
>> the child from the ->children list beforehand just prevents an 
>> additional recursion call (I think, and the fact that tests are 
>> passing seems to confirm my theory).
>>
>> Of course you know this stuff better than me, so let me know if 
>> something here is wrong.
> 
> Well.  I’m mostly wondering why you’re discussing how to do the drain 
> right when I was mostly curious about why we check the children and not 
> the parent for whether the graph is mutable at their respective 
> position. O:)
> 
> It was my impression that so far we mostly wrapped graph change 
> operations in drained sections (starting at the parent) and not leave it 
> to bdrv_replace_child_noperm() to do so.  That function only deals with 
> drain stuff because it balances the drain counters on old_bs and new_bs 
> to match the counter on the parent’s subgraph.

"mostly" is the problem, because if you try to use the full assertion 
introduced in this patch (ie assert also for drains), it will fail.

So I figured I would (will) put the drains as near as possible to the 
section where they are needed, since bdrv_replace_child_noperm is being 
called in many different ways. But this is a discussion for a future series.

Emanuele



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v4 00/25] block layer: split block APIs in global state and I/O
  2021-11-19  3:13       ` Paolo Bonzini
@ 2021-11-19 10:42         ` Emanuele Giuseppe Esposito
  0 siblings, 0 replies; 86+ messages in thread
From: Emanuele Giuseppe Esposito @ 2021-11-19 10:42 UTC (permalink / raw)
  To: Paolo Bonzini, Hanna Reitz
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Daniel P. Berrangé,
	Eduardo Habkost, open list:Block layer core, Juan Quintela,
	Eric Blake, Richard Henderson, qemu-devel, Markus Armbruster,
	Stefan Hajnoczi, John Snow, Dr. David Alan Gilbert



On 19/11/2021 04:13, Paolo Bonzini wrote:
> 
> 
> El jue., 18 nov. 2021 16:31, Hanna Reitz <hreitz@redhat.com 
> <mailto:hreitz@redhat.com>> escribió:
> 
>     On 18.11.21 14:50, Paolo Bonzini wrote:
>      > On 11/15/21 17:03, Hanna Reitz wrote:
>      >>
>      >> I only really see four solutions for this:
>      >> (1) We somehow make the amend job run in the main context under the
>      >> BQL and have it prevent all concurrent I/O access (seems bad)
>      >> (2) We can make the permission functions part of the I/O path
>     (seems
>      >> wrong and probably impossible?)
>      >> (3) We can drop the permissions update and permanently require the
>      >> permissions that we need when updating keys (I think this might
>     break
>      >> existing use cases)
>      >> (4) We can acquire the BQL around the permission update call and
>      >> perhaps that works?
>      >>
>      >> I don’t know how (4) would work but it’s basically the only
>      >> reasonable solution I can come up with.  Would this be a way to
>     call
>      >> a BQL function from an I/O function?
>      >
>      > I think that would deadlock:
>      >
>      >     main                I/O thread
>      >     --------            -----
>      >     start bdrv_co_amend
>      >                     take BQL
>      >     bdrv_drain
>      >     ... hangs ...
> 
>     :/
> 
>     Is there really nothing we can do?  Forgive me if I’m talking complete
>     nonsense here (because frankly I don’t even really know what a bottom
>     half is exactly), but can’t we schedule some coroutine in the main
>     thread to do the perm notifications and wait for them in the I/O thread?
> 
> 
> I think you still get a deadlock, just one with a longer chain. You 
> still have a cycle of things depending on each other, but one of them is 
> now the I/O thread waiting for the bottom half.
> 
>     Hmm...  Perhaps.  We would need to undo the permission change when the
>     job finishes, though, i.e. in JobDriver.prepare() or JobDriver.clean().
>     Doing the change in qmp_x_blockdev_amend() would be asymmetric then, so
>     we’d probably want a new JobDriver method that runs in the main thread
>     before .run() is invoked. (Unfortunately, “.prepare()” is now taken
>     already...)
> 
> 
> Ok at least it's feasible.

Ok I think I got it. I will create a new callback, maybe "pre_run" or 
something like that to perform the first bdrv_child_refresh_perms and 
implement the .clean callback to perform the "cleanup" 
bdrv_child_refresh_perms in block_crypto_amend_options_generic_luks.

> 
>     Doesn’t solve the FUSE problem, but there we could try to just take the
>     RESIZE permission permanently and if that fails, we just don’t allow
>     truncates for that export.  Not nice, but should work for common cases.
> 
> 
> Yeah definitely not nice. Probably permissions could be protected by 
> their own mutex, even a global one like the one we have for jobs. For 
> now I suggest just ignoring the problem and adding a comment, since it's 
> not really something that didn't exist.
> 

Will add a TODO in blk_set/get permissions explaining the issue.
Last issue we had with regards to permissions in GS had to do with 
bdrv_co_invalidate_cache: however, Paolo suggested me a simple fix to 
simply assert that the function is either under BQL or does not have 
open_flags & BDRV_O_INACTIVE set. This basically skips the permission 
code block, entering it only if we have the BQL.


Ok, apart from this permissions issue and assert_bdrv_graph_writable, I 
should have addressed all main comments of this series. Assume that for 
the others where I did not explicitly answered, I agree and applied your 
comments.

Thank you,
Emanuele



^ permalink raw reply	[flat|nested] 86+ messages in thread

end of thread, other threads:[~2021-11-19 10:44 UTC | newest]

Thread overview: 86+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-25 10:17 [PATCH v4 00/25] block layer: split block APIs in global state and I/O Emanuele Giuseppe Esposito
2021-10-25 10:17 ` [PATCH v4 01/25] main-loop.h: introduce qemu_in_main_thread() Emanuele Giuseppe Esposito
2021-10-25 11:33   ` Philippe Mathieu-Daudé
2021-10-25 10:17 ` [PATCH v4 02/25] include/block/block: split header into I/O and global state API Emanuele Giuseppe Esposito
2021-10-25 11:37   ` Philippe Mathieu-Daudé
2021-10-25 12:22     ` Emanuele Giuseppe Esposito
2021-11-11 15:00   ` Hanna Reitz
2021-11-15 12:08     ` Emanuele Giuseppe Esposito
2021-11-12 12:25   ` Hanna Reitz
2021-11-16 14:00     ` Emanuele Giuseppe Esposito
2021-10-25 10:17 ` [PATCH v4 03/25] assertions for block " Emanuele Giuseppe Esposito
2021-11-11 16:32   ` Hanna Reitz
2021-11-15 12:27     ` Emanuele Giuseppe Esposito
2021-11-15 15:27       ` Hanna Reitz
2021-11-12 11:31   ` Hanna Reitz
2021-10-25 10:17 ` [PATCH v4 04/25] include/sysemu/block-backend: split header into I/O and global state (GS) API Emanuele Giuseppe Esposito
2021-11-12 10:23   ` Hanna Reitz
2021-11-16 10:16     ` Emanuele Giuseppe Esposito
2021-11-12 12:30   ` Hanna Reitz
2021-11-16 14:24     ` Emanuele Giuseppe Esposito
2021-11-16 15:07       ` Hanna Reitz
2021-10-25 10:17 ` [PATCH v4 05/25] block/block-backend.c: assertions for block-backend Emanuele Giuseppe Esposito
2021-11-12 11:01   ` Hanna Reitz
2021-11-16 10:15     ` Emanuele Giuseppe Esposito
2021-11-16 12:29       ` Hanna Reitz
2021-10-25 10:17 ` [PATCH v4 06/25] include/block/block_int: split header into I/O and global state API Emanuele Giuseppe Esposito
2021-11-12 12:17   ` Hanna Reitz
2021-11-16 10:24     ` Emanuele Giuseppe Esposito
2021-11-16 12:30       ` Hanna Reitz
2021-10-25 10:17 ` [PATCH v4 07/25] assertions for block_int " Emanuele Giuseppe Esposito
2021-11-12 13:51   ` Hanna Reitz
2021-11-16 15:43     ` Emanuele Giuseppe Esposito
2021-11-16 16:46       ` Hanna Reitz
2021-10-25 10:17 ` [PATCH v4 08/25] block: introduce assert_bdrv_graph_writable Emanuele Giuseppe Esposito
2021-11-12 14:40   ` Hanna Reitz
2021-11-18  9:55     ` Emanuele Giuseppe Esposito
2021-11-18 10:24       ` Emanuele Giuseppe Esposito
2021-11-18 15:17       ` Hanna Reitz
2021-11-19  8:55         ` Emanuele Giuseppe Esposito
2021-10-25 10:17 ` [PATCH v4 09/25] include/block/blockjob_int.h: split header into I/O and GS API Emanuele Giuseppe Esposito
2021-10-25 10:17 ` [PATCH v4 10/25] assertions for blockjob_int.h Emanuele Giuseppe Esposito
2021-11-12 15:17   ` Hanna Reitz
2021-10-25 10:17 ` [PATCH v4 11/25] include/block/blockjob.h: global state API Emanuele Giuseppe Esposito
2021-10-25 10:17 ` [PATCH v4 12/25] assertions for blockob.h " Emanuele Giuseppe Esposito
2021-11-12 15:26   ` Hanna Reitz
2021-10-25 10:17 ` [PATCH v4 13/25] include/sysemu/blockdev.h: move drive_add and inline drive_def Emanuele Giuseppe Esposito
2021-11-12 15:41   ` Hanna Reitz
2021-10-25 10:17 ` [PATCH v4 14/25] include/systemu/blockdev.h: global state API Emanuele Giuseppe Esposito
2021-10-28 15:48   ` Stefan Hajnoczi
2021-11-12 15:46   ` Hanna Reitz
2021-10-25 10:17 ` [PATCH v4 15/25] assertions for blockdev.h " Emanuele Giuseppe Esposito
2021-10-25 10:17 ` [PATCH v4 16/25] include/block/snapshot: global state API + assertions Emanuele Giuseppe Esposito
2021-10-25 10:17 ` [PATCH v4 17/25] block/copy-before-write.h: " Emanuele Giuseppe Esposito
2021-10-25 10:17 ` [PATCH v4 18/25] block/coroutines: I/O API Emanuele Giuseppe Esposito
2021-10-25 10:17 ` [PATCH v4 19/25] block_int-common.h: split function pointers in BlockDriver Emanuele Giuseppe Esposito
2021-11-15 12:00   ` Hanna Reitz
2021-11-18 12:42     ` Emanuele Giuseppe Esposito
2021-10-25 10:17 ` [PATCH v4 20/25] block_int-common.h: assertion in the callers of BlockDriver function pointers Emanuele Giuseppe Esposito
2021-11-15 12:48   ` Hanna Reitz
2021-11-15 14:15     ` Hanna Reitz
2021-11-17 11:33     ` Emanuele Giuseppe Esposito
2021-11-17 12:51       ` Hanna Reitz
2021-11-17 13:09         ` Emanuele Giuseppe Esposito
2021-11-17 13:34           ` Hanna Reitz
2021-10-25 10:17 ` [PATCH v4 21/25] block_int-common.h: split function pointers in BdrvChildClass Emanuele Giuseppe Esposito
2021-11-15 14:36   ` Hanna Reitz
2021-10-25 10:17 ` [PATCH v4 22/25] block_int-common.h: assertions in the callers of BdrvChildClass function pointers Emanuele Giuseppe Esposito
2021-11-15 14:48   ` Hanna Reitz
2021-10-25 10:17 ` [PATCH v4 23/25] block-backend-common.h: split function pointers in BlockDevOps Emanuele Giuseppe Esposito
2021-10-25 14:10   ` Philippe Mathieu-Daudé
2021-10-25 10:17 ` [PATCH v4 24/25] job.h: split function pointers in JobDriver Emanuele Giuseppe Esposito
2021-11-15 15:11   ` Hanna Reitz
2021-11-17 13:43     ` Emanuele Giuseppe Esposito
2021-11-17 13:44       ` Hanna Reitz
2021-10-25 10:17 ` [PATCH v4 25/25] job.h: assertions in the callers of JobDriver funcion pointers Emanuele Giuseppe Esposito
2021-10-25 14:09 ` [PATCH v4 00/25] block layer: split block APIs in global state and I/O Philippe Mathieu-Daudé
2021-10-28 15:45   ` Stefan Hajnoczi
2021-10-28 15:49 ` Stefan Hajnoczi
2021-11-15 16:03 ` Hanna Reitz
2021-11-15 16:11   ` Daniel P. Berrangé
2021-11-18 13:50   ` Paolo Bonzini
2021-11-18 15:31     ` Hanna Reitz
2021-11-19  3:13       ` Paolo Bonzini
2021-11-19 10:42         ` Emanuele Giuseppe Esposito
2021-11-18 14:04   ` Paolo Bonzini
2021-11-18 15:22     ` Hanna Reitz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).