[Qemu-devel] KVM Forum block no[td]es

* [Qemu-devel] KVM Forum block no[td]es
@ 2018-11-11 22:25 Max Reitz
  2018-11-11 23:36 ` [Qemu-devel] [Qemu-block] " Nir Soffer
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Max Reitz @ 2018-11-11 22:25 UTC (permalink / raw)
  To: Qemu-block
  Cc: qemu-devel, Markus Armbruster, Alberto Garcia, Denis V. Lunev,
	Kevin Wolf, Vladimir Sementsov-Ogievskiy

[-- Attachment #1: Type: text/plain, Size: 10004 bytes --]

This is what I’ve taken from two or three BoF-like get-togethers on
blocky things.  Amendments are more than welcome, of course.

Permission system
=================

GRAPH_MOD
---------

We need some way for the commit job to prevent graph changes on its
chain while it is running.  Our current blocker doesn’t do the job,
however.  What to do?

- We have no idea how to make a *permission* work.  Maybe the biggest
  problem is that it just doesn’t work as a permission, because the
  commit job doesn’t own the BdrvChildren that would need to be
  blocked (namely the @backing BdrvChild).

- A property of BdrvChild that can be set by a non-parent seems more
  feasible, e.g. a counter where changing the child is possible only
  if the counter is 0.  This also actually makes sense in what it
  means.
  (We never quite knew what “taking the GRAPH_PERMISSION” or
  “unsharing the GRPAH_MOD permission” was supposed to mean.  Figuring
  that out always took like half an our in any face-to-face meeting,
  and then we decided it was pretty much useless for any case we had
  at hand.)

Reopen
------

How should permissions be handled while the reopen is under way?
Maybe we should take the union of @perm before and after, and the
intersection of @shared before and after?

- Taking permissions is a transaction that can fail.  Reopen, too, is
  a transaction, and we want to go from the intermediate to the final
  permissions in reopen’s commit part, so that transition is not
  allowed to fail.
  Since with the above model we would only relax things during that
  transition (relinquishing bits from @perm and adding bits to
  @shared), this transition should in theory be possible without any
  failure.  However, in practice things are different, as permission
  changes with file-posix nodes imply lock changes on the filesystem
  -- which may always fail.  Arguably failures from changing the
  file-posix locks can be ignored, because that just means that the
  file claims more permissions to be taken and less to be shared than
  is actually the case.  Which means you may not be able to open the
  file in some other application, while you should be, but that’s the
  benign kind of error.  You won’t be able to access data in a way
  you shouldn’t be able to.
  - Note that we have this issue already, so in general dropping
    permissions sometimes aborts because code assumes that dropping
    permissions is always safe and can never result in an error.  It
    seems best to ignore such protocol layer errors in the generic
    block layer rather than handling this in every protocol driver
    itself.
    (The block layer should discard errors from dropping permissions
    on the protocol layer.)

- Is it possible that changing an option may require taking an
  intermediate permission that is required neither before nor after
  the reopen process?
  Changing a child link comes to mind (like changing a child from one
  BDS to another, where the visible data changes, which would mean we
  may want to e.g. unshare CONSISTENT_READ during the reopen).
  However:
  1. It is unfeasible to unshare that for all child changes.
     Effectively everything requires CONSISTENT_READ, and for good
     reason.
  2. Why would a user even change a BDS to something of a different
     content?
  3. Anything that currently allows you to change a child node assumes
     that the user always changes it to something of the same content
     (some take extra care to verify this, like mirror, which makes
     sure that @replaces and the target are connected, and there are
     only filter nodes in between).
  Always using the same enforcing model as mirror does (no. 3 above)
  does not really work, though, because one use case is to copy a
  backing file offline to some different storage and then replace the
  files via QMP.  To qemu, both files are completely unrelated.

Block jobs, including blockdev-copy
===================================

Example for use of the fleecing filter:
- The real target is on slow storage.  Put an overlay on fast storage
  on top of it.  Then use that overlay as the target of the fleecing
  filter (and commit the data later or on the side), so that the
  backup job does not slow down the guest.

For a unified copy job, having a backup/fleecing filter is not a
problem on the way.  One thing we definitely have to and can do is to
copy common functionality into a shared file so that the different
jobs can at least share that.

COR/Stream:
- There should be a way to discard ranges that have been copied into
  the overlay from the backing files to save space
- Also, the COR filter should integrated with the stream job (at some
  point, as always)

Hole punching with active commit:
- Putting data into the target and punching holes in the overlays to
  make it visible on the active disk may be reasonable for some, but
  not for others -- it should be an option.  You want this if saving
  space is important, but you may not want this if speed is more
  important (depends on your backing chain length and other factors
  then, but that’s your choice).

- Another thing: If we don’t need to punch any holes because the
  intermediate layers aren’t allocated anyway, we don’t need to write
  the data into the active disk either.  This can probably be done
  indiscriminately, because the check for this does not concern the
  protocol layer but only qemu-controlled metadata, so it should be
  deterministically fast (want_zero=false).

qcow2
=====

Recovering corrupt images:
- Salvaging qemu-img convert would help (one that doesn’t abort
  everything on encountering a single I/O error)
- We may want to add an in-sync L1 table copy to recover from the
  worst kinds of corruptions.  Checksumming would be a good idea
  (then), too.
  - Should we update the checksum every time?  If it’s just the sum of
    all L1 entry values, why not, doing the update is trivial then and
    does not involve looking at any but the entries modified.

Online check:
- This would need to be a block job
- The check function would probably need to be a proper coroutine
  (that does not just lock everything)
- Would be very complicated if you wanted it to work on R/W images.
  It’s probably the best to focus on making this work for read-only
  images, because you can always just put a temporary snapshot over
  the image for the time of the test and then commit it down after the
  check is done.

Bitmaps
=======

(Got this section from sneaking into a BoF I wasn’t invited to.  Oh
well.  Won’t hurt to include them here.)

Currently, when dirty bitmaps are loaded, all IN_USE bitmaps are just
not loaded at all and completely ignored.  That isn’t correct, though,
they should either still be loaded (and automatically treated and
written back as fully dirty), or at least qemu-img check should
“repair” them (i.e. fully dirtying them).

Sometimes qemu (running in a mode as bare as possible) is better than
using qemu-img convert, for instance.  It gives you more control
(through QMP; you get e.g. better progress reporting), you get all of
the mirror optimizations (we do have optimizations for convert, too,
but whether it’s any good to write the same (or different?)
optimizations twice is another question), and you get a common
interface for everything (online and offline).
Note that besides a bare qemu we’ve also always wanted to convert as
many qemu-img operations into frontends for block jobs as possible.
We have only done this for commit, however, even though convert looked
like basically the ideal target.  It was just too hard with too little
apparent gain, like always (and convert supports additional features
like concatenation which we don’t have in the runtime block layer
yet).

Someone (not that someone™, but actually some specific someone) is
about to make qemu-img info display the list of persistent bitmaps.
Potential reviewers should be aware of the fact that this should be
done bye adding that information to ImageInfoSpecificQCow2.

Transacitonable bitmap primitives (e.g. copying a bitmap) would be
nice so you can use them when creating a snapshot.  Then it’d be up to
the management layer to make use of them:
- Do you want to continue using the very same bitmap?  Copy it then
  (or move it, depending on what exactly you want to do and what
  primitives there are)
- Do you want to start with a new bitmap?  Then just create a new one
  along with the overlay.

Misc topics
===========

SEEK_HOLE/SEEK_DATA:
- According to Denis, the bugs left in SEEK_HOLE and fiemap are the
  same now, but the former is slow when you seek over large ranges
  (because we just want to know whether a certain portion is allocated
  or not, but SEEK_HOLE/DATA actively seeks until the next hole/data
  region and queries all metadata on that path, regardless whether we
  even care anymore)
  - Whether the bugs are the same depends on the version of Linux,
    however, and there is no clear way to determine for qemu whether
    fiemap is usable or not
  - Making it a configure option would leave it to the user or
    distribution, who should know for sure

Multiqueue with multiple iothreads:
- Kevin says Paolo says he’s working on it.  But there are some
  prerequisites left, the main one apparently being that there is one
  aio_poll() left that polls from the wrong context.  With that gone,
  we can also probably drop AIO context altogether.

Some things we want from a cache block driver:
- An optional maximum resident memory size; in this case, the driver
  needs to be backed by another node it uses for swapping
- Should support taking a bitmap from the cached node, from which it
  would then preload all dirty clusters

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread