All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] KVM Forum block no[td]es
@ 2018-11-11 22:25 Max Reitz
  2018-11-11 23:36 ` [Qemu-devel] [Qemu-block] " Nir Soffer
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Max Reitz @ 2018-11-11 22:25 UTC (permalink / raw)
  To: Qemu-block
  Cc: qemu-devel, Markus Armbruster, Alberto Garcia, Denis V. Lunev,
	Kevin Wolf, Vladimir Sementsov-Ogievskiy

[-- Attachment #1: Type: text/plain, Size: 10004 bytes --]

This is what I’ve taken from two or three BoF-like get-togethers on
blocky things.  Amendments are more than welcome, of course.



Permission system
=================

GRAPH_MOD
---------

We need some way for the commit job to prevent graph changes on its
chain while it is running.  Our current blocker doesn’t do the job,
however.  What to do?

- We have no idea how to make a *permission* work.  Maybe the biggest
  problem is that it just doesn’t work as a permission, because the
  commit job doesn’t own the BdrvChildren that would need to be
  blocked (namely the @backing BdrvChild).

- A property of BdrvChild that can be set by a non-parent seems more
  feasible, e.g. a counter where changing the child is possible only
  if the counter is 0.  This also actually makes sense in what it
  means.
  (We never quite knew what “taking the GRAPH_PERMISSION” or
  “unsharing the GRPAH_MOD permission” was supposed to mean.  Figuring
  that out always took like half an our in any face-to-face meeting,
  and then we decided it was pretty much useless for any case we had
  at hand.)


Reopen
------

How should permissions be handled while the reopen is under way?
Maybe we should take the union of @perm before and after, and the
intersection of @shared before and after?

- Taking permissions is a transaction that can fail.  Reopen, too, is
  a transaction, and we want to go from the intermediate to the final
  permissions in reopen’s commit part, so that transition is not
  allowed to fail.
  Since with the above model we would only relax things during that
  transition (relinquishing bits from @perm and adding bits to
  @shared), this transition should in theory be possible without any
  failure.  However, in practice things are different, as permission
  changes with file-posix nodes imply lock changes on the filesystem
  -- which may always fail.  Arguably failures from changing the
  file-posix locks can be ignored, because that just means that the
  file claims more permissions to be taken and less to be shared than
  is actually the case.  Which means you may not be able to open the
  file in some other application, while you should be, but that’s the
  benign kind of error.  You won’t be able to access data in a way
  you shouldn’t be able to.
  - Note that we have this issue already, so in general dropping
    permissions sometimes aborts because code assumes that dropping
    permissions is always safe and can never result in an error.  It
    seems best to ignore such protocol layer errors in the generic
    block layer rather than handling this in every protocol driver
    itself.
    (The block layer should discard errors from dropping permissions
    on the protocol layer.)

- Is it possible that changing an option may require taking an
  intermediate permission that is required neither before nor after
  the reopen process?
  Changing a child link comes to mind (like changing a child from one
  BDS to another, where the visible data changes, which would mean we
  may want to e.g. unshare CONSISTENT_READ during the reopen).
  However:
  1. It is unfeasible to unshare that for all child changes.
     Effectively everything requires CONSISTENT_READ, and for good
     reason.
  2. Why would a user even change a BDS to something of a different
     content?
  3. Anything that currently allows you to change a child node assumes
     that the user always changes it to something of the same content
     (some take extra care to verify this, like mirror, which makes
     sure that @replaces and the target are connected, and there are
     only filter nodes in between).
  Always using the same enforcing model as mirror does (no. 3 above)
  does not really work, though, because one use case is to copy a
  backing file offline to some different storage and then replace the
  files via QMP.  To qemu, both files are completely unrelated.


Block jobs, including blockdev-copy
===================================

Example for use of the fleecing filter:
- The real target is on slow storage.  Put an overlay on fast storage
  on top of it.  Then use that overlay as the target of the fleecing
  filter (and commit the data later or on the side), so that the
  backup job does not slow down the guest.

For a unified copy job, having a backup/fleecing filter is not a
problem on the way.  One thing we definitely have to and can do is to
copy common functionality into a shared file so that the different
jobs can at least share that.

COR/Stream:
- There should be a way to discard ranges that have been copied into
  the overlay from the backing files to save space
- Also, the COR filter should integrated with the stream job (at some
  point, as always)

Hole punching with active commit:
- Putting data into the target and punching holes in the overlays to
  make it visible on the active disk may be reasonable for some, but
  not for others -- it should be an option.  You want this if saving
  space is important, but you may not want this if speed is more
  important (depends on your backing chain length and other factors
  then, but that’s your choice).

- Another thing: If we don’t need to punch any holes because the
  intermediate layers aren’t allocated anyway, we don’t need to write
  the data into the active disk either.  This can probably be done
  indiscriminately, because the check for this does not concern the
  protocol layer but only qemu-controlled metadata, so it should be
  deterministically fast (want_zero=false).


qcow2
=====

Recovering corrupt images:
- Salvaging qemu-img convert would help (one that doesn’t abort
  everything on encountering a single I/O error)
- We may want to add an in-sync L1 table copy to recover from the
  worst kinds of corruptions.  Checksumming would be a good idea
  (then), too.
  - Should we update the checksum every time?  If it’s just the sum of
    all L1 entry values, why not, doing the update is trivial then and
    does not involve looking at any but the entries modified.

Online check:
- This would need to be a block job
- The check function would probably need to be a proper coroutine
  (that does not just lock everything)
- Would be very complicated if you wanted it to work on R/W images.
  It’s probably the best to focus on making this work for read-only
  images, because you can always just put a temporary snapshot over
  the image for the time of the test and then commit it down after the
  check is done.


Bitmaps
=======

(Got this section from sneaking into a BoF I wasn’t invited to.  Oh
well.  Won’t hurt to include them here.)

Currently, when dirty bitmaps are loaded, all IN_USE bitmaps are just
not loaded at all and completely ignored.  That isn’t correct, though,
they should either still be loaded (and automatically treated and
written back as fully dirty), or at least qemu-img check should
“repair” them (i.e. fully dirtying them).

Sometimes qemu (running in a mode as bare as possible) is better than
using qemu-img convert, for instance.  It gives you more control
(through QMP; you get e.g. better progress reporting), you get all of
the mirror optimizations (we do have optimizations for convert, too,
but whether it’s any good to write the same (or different?)
optimizations twice is another question), and you get a common
interface for everything (online and offline).
Note that besides a bare qemu we’ve also always wanted to convert as
many qemu-img operations into frontends for block jobs as possible.
We have only done this for commit, however, even though convert looked
like basically the ideal target.  It was just too hard with too little
apparent gain, like always (and convert supports additional features
like concatenation which we don’t have in the runtime block layer
yet).

Someone (not that someone™, but actually some specific someone) is
about to make qemu-img info display the list of persistent bitmaps.
Potential reviewers should be aware of the fact that this should be
done bye adding that information to ImageInfoSpecificQCow2.

Transacitonable bitmap primitives (e.g. copying a bitmap) would be
nice so you can use them when creating a snapshot.  Then it’d be up to
the management layer to make use of them:
- Do you want to continue using the very same bitmap?  Copy it then
  (or move it, depending on what exactly you want to do and what
  primitives there are)
- Do you want to start with a new bitmap?  Then just create a new one
  along with the overlay.


Misc topics
===========

SEEK_HOLE/SEEK_DATA:
- According to Denis, the bugs left in SEEK_HOLE and fiemap are the
  same now, but the former is slow when you seek over large ranges
  (because we just want to know whether a certain portion is allocated
  or not, but SEEK_HOLE/DATA actively seeks until the next hole/data
  region and queries all metadata on that path, regardless whether we
  even care anymore)
  - Whether the bugs are the same depends on the version of Linux,
    however, and there is no clear way to determine for qemu whether
    fiemap is usable or not
  - Making it a configure option would leave it to the user or
    distribution, who should know for sure

Multiqueue with multiple iothreads:
- Kevin says Paolo says he’s working on it.  But there are some
  prerequisites left, the main one apparently being that there is one
  aio_poll() left that polls from the wrong context.  With that gone,
  we can also probably drop AIO context altogether.

Some things we want from a cache block driver:
- An optional maximum resident memory size; in this case, the driver
  needs to be backed by another node it uses for swapping
- Should support taking a bitmap from the cached node, from which it
  would then preload all dirty clusters


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [Qemu-block] KVM Forum block no[td]es
  2018-11-11 22:25 [Qemu-devel] KVM Forum block no[td]es Max Reitz
@ 2018-11-11 23:36 ` Nir Soffer
  2018-11-12 15:25   ` Max Reitz
  2018-11-13 15:12 ` [Qemu-devel] " Alberto Garcia
  2018-11-16  7:47 ` Denis V.Lunev
  2 siblings, 1 reply; 22+ messages in thread
From: Nir Soffer @ 2018-11-11 23:36 UTC (permalink / raw)
  To: Max Reitz
  Cc: qemu-block, Kevin Wolf, vsementsov, QEMU Developers,
	Markus Armbruster, den

On Mon, Nov 12, 2018 at 12:25 AM Max Reitz <mreitz@redhat.com> wrote:

> This is what I’ve taken from two or three BoF-like get-togethers on
> blocky things.  Amendments are more than welcome, of course.

...

> Bitmaps
>
=======
>
> (Got this section from sneaking into a BoF I wasn’t invited to.  Oh
> well.  Won’t hurt to include them here.)
>
> Currently, when dirty bitmaps are loaded, all IN_USE bitmaps are just
> not loaded at all and completely ignored.  That isn’t correct, though,
> they should either still be loaded (and automatically treated and
> written back as fully dirty), or at least qemu-img check should
> “repair” them (i.e. fully dirtying them).
>

I'm not sure making bitmaps dirty is better.

When bitmap is marked IN_USE, it means that we cannot use it for
backup. Deleting the bitmap or making it as bad so it cannot be used
for the next backup is the same. Making the bitmap as dirty will full
the management layer that everything was fine when the next backup
includes the entire disk. It is better to cause the next backup to fail
in a verbose way, since the backup software can recover by doing
a full backup.

Sometimes qemu (running in a mode as bare as possible) is better than
> using qemu-img convert, for instance.  It gives you more control
> (through QMP; you get e.g. better progress reporting), you get all of
> the mirror optimizations (we do have optimizations for convert, too,
> but whether it’s any good to write the same (or different?)
> optimizations twice is another question), and you get a common
> interface for everything (online and offline).
> Note that besides a bare qemu we’ve also always wanted to convert as
> many qemu-img operations into frontends for block jobs as possible.
> We have only done this for commit, however, even though convert looked
> like basically the ideal target.  It was just too hard with too little
> apparent gain, like always (and convert supports additional features
> like concatenation which we don’t have in the runtime block layer
> yet).
>

I'm not sure it is better to run qemu and use qemu-img as thin wrapper
over qmp.

For example, management system may ascociate qemu-img
with a sanlock lease, and sanlock may try to terminate qemu-img when
a lease is invalidated. In this case sanlock will succeed while qemu is till
accessing storage it should not access.
...

> Transacitonable bitmap primitives (e.g. copying a bitmap) would be
> nice so you can use them when creating a snapshot.  Then it’d be up to
> the management layer to make use of them:
> - Do you want to continue using the very same bitmap?  Copy it then
>   (or move it, depending on what exactly you want to do and what
>   primitives there are)
> - Do you want to start with a new bitmap?  Then just create a new one
>   along with the overlay.
>

Having both options sounds good, but we can start with the first.

Nir

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [Qemu-block] KVM Forum block no[td]es
  2018-11-11 23:36 ` [Qemu-devel] [Qemu-block] " Nir Soffer
@ 2018-11-12 15:25   ` Max Reitz
  2018-11-12 16:10     ` Nir Soffer
  2018-11-14 19:38     ` John Snow
  0 siblings, 2 replies; 22+ messages in thread
From: Max Reitz @ 2018-11-12 15:25 UTC (permalink / raw)
  To: Nir Soffer
  Cc: qemu-block, Kevin Wolf, vsementsov, QEMU Developers,
	Markus Armbruster, den

[-- Attachment #1: Type: text/plain, Size: 3084 bytes --]

On 12.11.18 00:36, Nir Soffer wrote:
> On Mon, Nov 12, 2018 at 12:25 AM Max Reitz <mreitz@redhat.com
> <mailto:mreitz@redhat.com>> wrote:
> 
>     This is what I’ve taken from two or three BoF-like get-togethers on
>     blocky things.  Amendments are more than welcome, of course.
> 
> ... 
> 
>     Bitmaps
> 
>     =======
> 
>     (Got this section from sneaking into a BoF I wasn’t invited to.  Oh
>     well.  Won’t hurt to include them here.)
> 
>     Currently, when dirty bitmaps are loaded, all IN_USE bitmaps are just
>     not loaded at all and completely ignored.  That isn’t correct, though,
>     they should either still be loaded (and automatically treated and
>     written back as fully dirty), or at least qemu-img check should
>     “repair” them (i.e. fully dirtying them).
> 
> 
> I'm not sure making bitmaps dirty is better.
> 
> When bitmap is marked IN_USE, it means that we cannot use it for
> backup. Deleting the bitmap or making it as bad so it cannot be used
> for the next backup is the same. Making the bitmap as dirty will full
> the management layer that everything was fine when the next backup
> includes the entire disk. It is better to cause the next backup to fail
> in a verbose way, since the backup software can recover by doing
> a full backup.

But making the dirty bitmap fully dirty will automatically enforce a
full backup.

>     Sometimes qemu (running in a mode as bare as possible) is better than
>     using qemu-img convert, for instance.  It gives you more control
>     (through QMP; you get e.g. better progress reporting), you get all of
>     the mirror optimizations (we do have optimizations for convert, too,
>     but whether it’s any good to write the same (or different?)
>     optimizations twice is another question), and you get a common
>     interface for everything (online and offline).
>     Note that besides a bare qemu we’ve also always wanted to convert as
>     many qemu-img operations into frontends for block jobs as possible.
>     We have only done this for commit, however, even though convert looked
>     like basically the ideal target.  It was just too hard with too little
>     apparent gain, like always (and convert supports additional features
>     like concatenation which we don’t have in the runtime block layer
>     yet).
> 
> 
> I'm not sure it is better to run qemu and use qemu-img as thin wrapper
> over qmp.
> 
> For example, management system may ascociate qemu-img
> with a sanlock lease, and sanlock may try to terminate qemu-img when
> a lease is invalidated. In this case sanlock will succeed while qemu is till
> accessing storage it should not access.
> ...

The point was not to run two processes, but to link the necessary bits
of qemu into qemu-img and then use them inside the single qemu-img
process itself.  As hinted at above, we've been doing this for qemu-img
commit for quite some time; that command actually runs a block job under
the hood (inside of qemu-img).

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [Qemu-block] KVM Forum block no[td]es
  2018-11-12 15:25   ` Max Reitz
@ 2018-11-12 16:10     ` Nir Soffer
  2018-11-21  1:24       ` Vladimir Sementsov-Ogievskiy
  2018-11-14 19:38     ` John Snow
  1 sibling, 1 reply; 22+ messages in thread
From: Nir Soffer @ 2018-11-12 16:10 UTC (permalink / raw)
  To: Max Reitz
  Cc: qemu-block, Kevin Wolf, vsementsov, QEMU Developers,
	Markus Armbruster, den

On Mon, Nov 12, 2018 at 5:26 PM Max Reitz <mreitz@redhat.com> wrote:

> On 12.11.18 00:36, Nir Soffer wrote:
> > On Mon, Nov 12, 2018 at 12:25 AM Max Reitz <mreitz@redhat.com
> > <mailto:mreitz@redhat.com>> wrote:
> >
> >     This is what I’ve taken from two or three BoF-like get-togethers on
> >     blocky things.  Amendments are more than welcome, of course.
> >
> > ...
> >
> >     Bitmaps
> >
> >     =======
> >
> >     (Got this section from sneaking into a BoF I wasn’t invited to.  Oh
> >     well.  Won’t hurt to include them here.)
> >
> >     Currently, when dirty bitmaps are loaded, all IN_USE bitmaps are just
> >     not loaded at all and completely ignored.  That isn’t correct,
> though,
> >     they should either still be loaded (and automatically treated and
> >     written back as fully dirty), or at least qemu-img check should
> >     “repair” them (i.e. fully dirtying them).
> >
> >
> > I'm not sure making bitmaps dirty is better.
> >
> > When bitmap is marked IN_USE, it means that we cannot use it for
> > backup. Deleting the bitmap or making it as bad so it cannot be used
> > for the next backup is the same. Making the bitmap as dirty will full
> > the management layer that everything was fine when the next backup
> > includes the entire disk. It is better to cause the next backup to fail
> > in a verbose way, since the backup software can recover by doing
> > a full backup.
>
> But making the dirty bitmap fully dirty will automatically enforce a
> full backup.
>

True, but is also hide the fact that we lost the bitmap. I think we should
start with the simple way of making it fail and let the rest of the stack
fallback to full backup.

>     Sometimes qemu (running in a mode as bare as possible) is better than
> >     using qemu-img convert, for instance.  It gives you more control
> >     (through QMP; you get e.g. better progress reporting), you get all of
> >     the mirror optimizations (we do have optimizations for convert, too,
> >     but whether it’s any good to write the same (or different?)
> >     optimizations twice is another question), and you get a common
> >     interface for everything (online and offline).
> >     Note that besides a bare qemu we’ve also always wanted to convert as
> >     many qemu-img operations into frontends for block jobs as possible.
> >     We have only done this for commit, however, even though convert
> looked
> >     like basically the ideal target.  It was just too hard with too
> little
> >     apparent gain, like always (and convert supports additional features
> >     like concatenation which we don’t have in the runtime block layer
> >     yet).
> >
> >
> > I'm not sure it is better to run qemu and use qemu-img as thin wrapper
> > over qmp.
> >
> > For example, management system may ascociate qemu-img
> > with a sanlock lease, and sanlock may try to terminate qemu-img when
> > a lease is invalidated. In this case sanlock will succeed while qemu is
> till
> > accessing storage it should not access.
> > ...
>
> The point was not to run two processes, but to link the necessary bits
> of qemu into qemu-img and then use them inside the single qemu-img
> process itself.  As hinted at above, we've been doing this for qemu-img
> commit for quite some time; that command actually runs a block job under
> the hood (inside of qemu-img).
>

Sounds great.

Nir

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] KVM Forum block no[td]es
  2018-11-11 22:25 [Qemu-devel] KVM Forum block no[td]es Max Reitz
  2018-11-11 23:36 ` [Qemu-devel] [Qemu-block] " Nir Soffer
@ 2018-11-13 15:12 ` Alberto Garcia
  2018-11-14 17:24   ` Max Reitz
  2018-11-16  7:47 ` Denis V.Lunev
  2 siblings, 1 reply; 22+ messages in thread
From: Alberto Garcia @ 2018-11-13 15:12 UTC (permalink / raw)
  To: Max Reitz, Qemu-block
  Cc: qemu-devel, Markus Armbruster, Denis V. Lunev, Kevin Wolf,
	Vladimir Sementsov-Ogievskiy

On Sun 11 Nov 2018 11:25:00 PM CET, Max Reitz wrote:

> Permission system
> =================
>
> GRAPH_MOD
> ---------
>
> We need some way for the commit job to prevent graph changes on its
> chain while it is running.  Our current blocker doesn’t do the job,
> however.  What to do?
>
> - We have no idea how to make a *permission* work.  Maybe the biggest
>   problem is that it just doesn’t work as a permission, because the
>   commit job doesn’t own the BdrvChildren that would need to be
>   blocked (namely the @backing BdrvChild).

What I'm doing at the moment in my blockdev-reopen series is check
whether all parents of the node I want to change share the GRAPH_MOD
permission. If any of them doesn't then the operation fails.

This can be checked by calling bdrv_get_cumulative_perm() or simply
bdrv_check_update_perm(..., BLK_PERM_GRAPH_MOD, ...).

It solves the problem because e.g. the stream block job only allows
BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED, so no graph
modifications allowed.

>   (We never quite knew what “taking the GRAPH_PERMISSION” or
>   “unsharing the GRPAH_MOD permission” was supposed to mean.  Figuring
>   that out always took like half an our in any face-to-face meeting,
>   and then we decided it was pretty much useless for any case we had
>   at hand.)

Yeah it's a bit unclear. It can mean "none of this node's children can
be replaced / removed", which is what I'm using it for at the moment.

> Reopen
> ------
>
> How should permissions be handled while the reopen is under way?
> Maybe we should take the union of @perm before and after, and the
> intersection of @shared before and after?

Do you have an example of a case in which you're reopening a node and
the change of permissions causes a problem?

> - Is it possible that changing an option may require taking an
>   intermediate permission that is required neither before nor after
>   the reopen process?
>   Changing a child link comes to mind (like changing a child from one
>   BDS to another, where the visible data changes, which would mean we
>   may want to e.g. unshare CONSISTENT_READ during the reopen).
>   However:
>   1. It is unfeasible to unshare that for all child changes.
>      Effectively everything requires CONSISTENT_READ, and for good
>      reason.
>   2. Why would a user even change a BDS to something of a different
>      content?
>   3. Anything that currently allows you to change a child node assumes
>      that the user always changes it to something of the same content
>      (some take extra care to verify this, like mirror, which makes
>      sure that @replaces and the target are connected, and there are
>      only filter nodes in between).

I don't think the user wants to change a BDS to something of a different
content, but it may happen that QEMU doesn't have a way to verify
whether the content is the same or not.

I think we have one such case already with 'blockdev-snapshot', in which
you add a BDS with blockdev-add and then add its contents on top of an
existing BDS.

Berto

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] KVM Forum block no[td]es
  2018-11-13 15:12 ` [Qemu-devel] " Alberto Garcia
@ 2018-11-14 17:24   ` Max Reitz
  2018-11-15 14:28     ` Alberto Garcia
  0 siblings, 1 reply; 22+ messages in thread
From: Max Reitz @ 2018-11-14 17:24 UTC (permalink / raw)
  To: Alberto Garcia, Qemu-block
  Cc: qemu-devel, Markus Armbruster, Denis V. Lunev, Kevin Wolf,
	Vladimir Sementsov-Ogievskiy

[-- Attachment #1: Type: text/plain, Size: 4735 bytes --]

On 13.11.18 16:12, Alberto Garcia wrote:
> On Sun 11 Nov 2018 11:25:00 PM CET, Max Reitz wrote:
> 
>> Permission system
>> =================
>>
>> GRAPH_MOD
>> ---------
>>
>> We need some way for the commit job to prevent graph changes on its
>> chain while it is running.  Our current blocker doesn’t do the job,
>> however.  What to do?
>>
>> - We have no idea how to make a *permission* work.  Maybe the biggest
>>   problem is that it just doesn’t work as a permission, because the
>>   commit job doesn’t own the BdrvChildren that would need to be
>>   blocked (namely the @backing BdrvChild).
> 
> What I'm doing at the moment in my blockdev-reopen series is check
> whether all parents of the node I want to change share the GRAPH_MOD
> permission. If any of them doesn't then the operation fails.
> 
> This can be checked by calling bdrv_get_cumulative_perm() or simply
> bdrv_check_update_perm(..., BLK_PERM_GRAPH_MOD, ...).

Sure.

> It solves the problem because e.g. the stream block job only allows
> BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED, so no graph
> modifications allowed.

But the problem is that the commit job needs to unshare the permission
on all backing links between base and top, so none of the backing links
can be broken.  But it cannot do that because those backing links do not
belong to it (but to the respective overlay nodes).

>>   (We never quite knew what “taking the GRAPH_PERMISSION” or
>>   “unsharing the GRPAH_MOD permission” was supposed to mean.  Figuring
>>   that out always took like half an our in any face-to-face meeting,
>>   and then we decided it was pretty much useless for any case we had
>>   at hand.)
> 
> Yeah it's a bit unclear. It can mean "none of this node's children can
> be replaced / removed", which is what I'm using it for at the moment.

Yes, but that's just not useful.  I don't think we have any case where
something would be annoyed by having its immediate child replaced (other
than the visible data changing, potentially).  Usually when we say
something (e.g. a block job) wants to prevent graph modification, it's
about changing edges that exist independently of the job (such as a
backing chain).

>> Reopen
>> ------
>>
>> How should permissions be handled while the reopen is under way?
>> Maybe we should take the union of @perm before and after, and the
>> intersection of @shared before and after?
> 
> Do you have an example of a case in which you're reopening a node and
> the change of permissions causes a problem?

I do not.  So currently we switch from the permissions before to the
ones after, right?  Which should be safe because that switch is a
transaction that is integrated into reopen.

However, I suppose it's possible the protocol layer gives us some issues
again.  It cannot switch the locks before commit, but committing may
fail.  What to do then?

It would be safer if we took the union/intersection in
bdrv_reopen_prepare() and then released the unnecessary flags in
bdrv_reopen_commit().  We can still get errors (as discussed), but those
can be safely ignored.  (They're just annoying, but not harmful.)

>> - Is it possible that changing an option may require taking an
>>   intermediate permission that is required neither before nor after
>>   the reopen process?
>>   Changing a child link comes to mind (like changing a child from one
>>   BDS to another, where the visible data changes, which would mean we
>>   may want to e.g. unshare CONSISTENT_READ during the reopen).
>>   However:
>>   1. It is unfeasible to unshare that for all child changes.
>>      Effectively everything requires CONSISTENT_READ, and for good
>>      reason.
>>   2. Why would a user even change a BDS to something of a different
>>      content?
>>   3. Anything that currently allows you to change a child node assumes
>>      that the user always changes it to something of the same content
>>      (some take extra care to verify this, like mirror, which makes
>>      sure that @replaces and the target are connected, and there are
>>      only filter nodes in between).
> 
> I don't think the user wants to change a BDS to something of a different
> content, but it may happen that QEMU doesn't have a way to verify
> whether the content is the same or not.
> 
> I think we have one such case already with 'blockdev-snapshot', in which
> you add a BDS with blockdev-add and then add its contents on top of an
> existing BDS.

Yeah.  The new overlay is expected to be empty so when you replace the
now-snapshot with it, the parent nodes don't see any change.  If it's
not empty...  Well, your problem.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [Qemu-block] KVM Forum block no[td]es
  2018-11-12 15:25   ` Max Reitz
  2018-11-12 16:10     ` Nir Soffer
@ 2018-11-14 19:38     ` John Snow
  1 sibling, 0 replies; 22+ messages in thread
From: John Snow @ 2018-11-14 19:38 UTC (permalink / raw)
  To: Max Reitz, Nir Soffer
  Cc: Kevin Wolf, vsementsov, qemu-block, QEMU Developers,
	Markus Armbruster, den



On 11/12/18 10:25 AM, Max Reitz wrote:
> On 12.11.18 00:36, Nir Soffer wrote:
>> On Mon, Nov 12, 2018 at 12:25 AM Max Reitz <mreitz@redhat.com
>> <mailto:mreitz@redhat.com>> wrote:
>>
>>     This is what I’ve taken from two or three BoF-like get-togethers on
>>     blocky things.  Amendments are more than welcome, of course.
>>
>> ... 
>>
>>     Bitmaps
>>
>>     =======
>>
>>     (Got this section from sneaking into a BoF I wasn’t invited to.  Oh
>>     well.  Won’t hurt to include them here.)
>>
>>     Currently, when dirty bitmaps are loaded, all IN_USE bitmaps are just
>>     not loaded at all and completely ignored.  That isn’t correct, though,
>>     they should either still be loaded (and automatically treated and
>>     written back as fully dirty), or at least qemu-img check should
>>     “repair” them (i.e. fully dirtying them).
>>
>>
>> I'm not sure making bitmaps dirty is better.
>>
>> When bitmap is marked IN_USE, it means that we cannot use it for
>> backup. Deleting the bitmap or making it as bad so it cannot be used
>> for the next backup is the same. Making the bitmap as dirty will full
>> the management layer that everything was fine when the next backup
>> includes the entire disk. It is better to cause the next backup to fail
>> in a verbose way, since the backup software can recover by doing
>> a full backup.
> 
> But making the dirty bitmap fully dirty will automatically enforce a
> full backup.
> 

I lean towards just deleting the bitmap in these cases but *not*
automatically.

If you do check -r, though, deleting them seems correct. This way you
don't get any nasty surprises when the cronjob goes to make an
incremental and it's accidentally 2TB and still running when you arrive
on Monday morning.

>>     Sometimes qemu (running in a mode as bare as possible) is better than
>>     using qemu-img convert, for instance.  It gives you more control
>>     (through QMP; you get e.g. better progress reporting), you get all of
>>     the mirror optimizations (we do have optimizations for convert, too,
>>     but whether it’s any good to write the same (or different?)
>>     optimizations twice is another question), and you get a common
>>     interface for everything (online and offline).
>>     Note that besides a bare qemu we’ve also always wanted to convert as
>>     many qemu-img operations into frontends for block jobs as possible.
>>     We have only done this for commit, however, even though convert looked
>>     like basically the ideal target.  It was just too hard with too little
>>     apparent gain, like always (and convert supports additional features
>>     like concatenation which we don’t have in the runtime block layer
>>     yet).
>>
>>
>> I'm not sure it is better to run qemu and use qemu-img as thin wrapper
>> over qmp.
>>
>> For example, management system may ascociate qemu-img
>> with a sanlock lease, and sanlock may try to terminate qemu-img when
>> a lease is invalidated. In this case sanlock will succeed while qemu is till
>> accessing storage it should not access.
>> ...
> 
> The point was not to run two processes, but to link the necessary bits
> of qemu into qemu-img and then use them inside the single qemu-img
> process itself.  As hinted at above, we've been doing this for qemu-img
> commit for quite some time; that command actually runs a block job under
> the hood (inside of qemu-img).
> 
> Max
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] KVM Forum block no[td]es
  2018-11-14 17:24   ` Max Reitz
@ 2018-11-15 14:28     ` Alberto Garcia
  2018-11-16 12:14       ` Max Reitz
  0 siblings, 1 reply; 22+ messages in thread
From: Alberto Garcia @ 2018-11-15 14:28 UTC (permalink / raw)
  To: Max Reitz, Qemu-block
  Cc: qemu-devel, Markus Armbruster, Denis V. Lunev, Kevin Wolf,
	Vladimir Sementsov-Ogievskiy

On Wed 14 Nov 2018 06:24:10 PM CET, Max Reitz wrote:
>>> Permission system
>>> =================
>>>
>>> GRAPH_MOD
>>> ---------
>>>
>>> We need some way for the commit job to prevent graph changes on its
>>> chain while it is running.  Our current blocker doesn’t do the job,
>>> however.  What to do?
>>>
>>> - We have no idea how to make a *permission* work.  Maybe the biggest
>>>   problem is that it just doesn’t work as a permission, because the
>>>   commit job doesn’t own the BdrvChildren that would need to be
>>>   blocked (namely the @backing BdrvChild).
>> 
>> What I'm doing at the moment in my blockdev-reopen series is check
>> whether all parents of the node I want to change share the GRAPH_MOD
>> permission. If any of them doesn't then the operation fails.
>> 
>> This can be checked by calling bdrv_get_cumulative_perm() or simply
>> bdrv_check_update_perm(..., BLK_PERM_GRAPH_MOD, ...).
>
> Sure.
>
>> It solves the problem because e.g. the stream block job only allows
>> BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED, so no graph
>> modifications allowed.
>
> But the problem is that the commit job needs to unshare the permission
> on all backing links between base and top, so none of the backing
> links can be broken.  But it cannot do that because those backing
> links do not belong to it (but to the respective overlay nodes).

I'm not sure if I'm following you. The commit block job has references
to all intermediate nodes (with block_job_add_bdrv()) and only shares
BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE. Anyone trying to obtain
BLK_PERM_GRAPH_MOD on an intermediate node would see that the block job
doesn't allow it.

>>>   (We never quite knew what “taking the GRAPH_PERMISSION” or
>>>   “unsharing the GRPAH_MOD permission” was supposed to mean.
>>>   Figuring that out always took like half an our in any face-to-face
>>>   meeting, and then we decided it was pretty much useless for any
>>>   case we had at hand.)
>> 
>> Yeah it's a bit unclear. It can mean "none of this node's children
>> can be replaced / removed", which is what I'm using it for at the
>> moment.
>
> Yes, but that's just not useful.  I don't think we have any case where
> something would be annoyed by having its immediate child replaced
> (other than the visible data changing, potentially).  Usually when we
> say something (e.g. a block job) wants to prevent graph modification,
> it's about changing edges that exist independently of the job (such as
> a backing chain).

Isn't that what I just said? You cannot change other edges <=> you
cannot replace a node's children (or parents).

>>> Reopen
>>> ------
>>>
>>> How should permissions be handled while the reopen is under way?
>>> Maybe we should take the union of @perm before and after, and the
>>> intersection of @shared before and after?
>> 
>> Do you have an example of a case in which you're reopening a node and
>> the change of permissions causes a problem?
>
> I do not.  So currently we switch from the permissions before to the
> ones after, right?  Which should be safe because that switch is a
> transaction that is integrated into reopen.
>
> However, I suppose it's possible the protocol layer gives us some
> issues again.  It cannot switch the locks before commit, but
> committing may fail.  What to do then?
>
> It would be safer if we took the union/intersection in
> bdrv_reopen_prepare() and then released the unnecessary flags in
> bdrv_reopen_commit().  We can still get errors (as discussed), but
> those can be safely ignored.  (They're just annoying, but not
> harmful.)

Ok, I suppose you're right.

Berto

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] KVM Forum block no[td]es
  2018-11-11 22:25 [Qemu-devel] KVM Forum block no[td]es Max Reitz
  2018-11-11 23:36 ` [Qemu-devel] [Qemu-block] " Nir Soffer
  2018-11-13 15:12 ` [Qemu-devel] " Alberto Garcia
@ 2018-11-16  7:47 ` Denis V.Lunev
  2 siblings, 0 replies; 22+ messages in thread
From: Denis V.Lunev @ 2018-11-16  7:47 UTC (permalink / raw)
  To: Max Reitz, Qemu-block
  Cc: qemu-devel, Markus Armbruster, Alberto Garcia, Kevin Wolf,
	Vladimir Sementsov-Ogievskiy

On 11/12/18 1:25 AM, Max Reitz wrote:
> This is what I’ve taken from two or three BoF-like get-togethers on
> blocky things.  Amendments are more than welcome, of course.
>
>
>
> Permission system
> =================
>
> GRAPH_MOD
> ---------
>
> We need some way for the commit job to prevent graph changes on its
> chain while it is running.  Our current blocker doesn’t do the job,
> however.  What to do?
>
> - We have no idea how to make a *permission* work.  Maybe the biggest
>   problem is that it just doesn’t work as a permission, because the
>   commit job doesn’t own the BdrvChildren that would need to be
>   blocked (namely the @backing BdrvChild).
>
> - A property of BdrvChild that can be set by a non-parent seems more
>   feasible, e.g. a counter where changing the child is possible only
>   if the counter is 0.  This also actually makes sense in what it
>   means.
>   (We never quite knew what “taking the GRAPH_PERMISSION” or
>   “unsharing the GRPAH_MOD permission” was supposed to mean.  Figuring
>   that out always took like half an our in any face-to-face meeting,
>   and then we decided it was pretty much useless for any case we had
>   at hand.)
>
>
> Reopen
> ------
>
> How should permissions be handled while the reopen is under way?
> Maybe we should take the union of @perm before and after, and the
> intersection of @shared before and after?
>
> - Taking permissions is a transaction that can fail.  Reopen, too, is
>   a transaction, and we want to go from the intermediate to the final
>   permissions in reopen’s commit part, so that transition is not
>   allowed to fail.
>   Since with the above model we would only relax things during that
>   transition (relinquishing bits from @perm and adding bits to
>   @shared), this transition should in theory be possible without any
>   failure.  However, in practice things are different, as permission
>   changes with file-posix nodes imply lock changes on the filesystem
>   -- which may always fail.  Arguably failures from changing the
>   file-posix locks can be ignored, because that just means that the
>   file claims more permissions to be taken and less to be shared than
>   is actually the case.  Which means you may not be able to open the
>   file in some other application, while you should be, but that’s the
>   benign kind of error.  You won’t be able to access data in a way
>   you shouldn’t be able to.
>   - Note that we have this issue already, so in general dropping
>     permissions sometimes aborts because code assumes that dropping
>     permissions is always safe and can never result in an error.  It
>     seems best to ignore such protocol layer errors in the generic
>     block layer rather than handling this in every protocol driver
>     itself.
>     (The block layer should discard errors from dropping permissions
>     on the protocol layer.)
>
> - Is it possible that changing an option may require taking an
>   intermediate permission that is required neither before nor after
>   the reopen process?
>   Changing a child link comes to mind (like changing a child from one
>   BDS to another, where the visible data changes, which would mean we
>   may want to e.g. unshare CONSISTENT_READ during the reopen).
>   However:
>   1. It is unfeasible to unshare that for all child changes.
>      Effectively everything requires CONSISTENT_READ, and for good
>      reason.
>   2. Why would a user even change a BDS to something of a different
>      content?
>   3. Anything that currently allows you to change a child node assumes
>      that the user always changes it to something of the same content
>      (some take extra care to verify this, like mirror, which makes
>      sure that @replaces and the target are connected, and there are
>      only filter nodes in between).
>   Always using the same enforcing model as mirror does (no. 3 above)
>   does not really work, though, because one use case is to copy a
>   backing file offline to some different storage and then replace the
>   files via QMP.  To qemu, both files are completely unrelated.
>
>
> Block jobs, including blockdev-copy
> ===================================
>
> Example for use of the fleecing filter:
> - The real target is on slow storage.  Put an overlay on fast storage
>   on top of it.  Then use that overlay as the target of the fleecing
>   filter (and commit the data later or on the side), so that the
>   backup job does not slow down the guest.
>
> For a unified copy job, having a backup/fleecing filter is not a
> problem on the way.  One thing we definitely have to and can do is to
> copy common functionality into a shared file so that the different
> jobs can at least share that.
>
> COR/Stream:
> - There should be a way to discard ranges that have been copied into
>   the overlay from the backing files to save space
> - Also, the COR filter should integrated with the stream job (at some
>   point, as always)
>
> Hole punching with active commit:
> - Putting data into the target and punching holes in the overlays to
>   make it visible on the active disk may be reasonable for some, but
>   not for others -- it should be an option.  You want this if saving
>   space is important, but you may not want this if speed is more
>   important (depends on your backing chain length and other factors
>   then, but that’s your choice).
>
> - Another thing: If we don’t need to punch any holes because the
>   intermediate layers aren’t allocated anyway, we don’t need to write
>   the data into the active disk either.  This can probably be done
>   indiscriminately, because the check for this does not concern the
>   protocol layer but only qemu-controlled metadata, so it should be
>   deterministically fast (want_zero=false).
>
>
> qcow2
> =====
>
> Recovering corrupt images:
> - Salvaging qemu-img convert would help (one that doesn’t abort
>   everything on encountering a single I/O error)
> - We may want to add an in-sync L1 table copy to recover from the
>   worst kinds of corruptions.  Checksumming would be a good idea
>   (then), too.
>   - Should we update the checksum every time?  If it’s just the sum of
>     all L1 entry values, why not, doing the update is trivial then and
>     does not involve looking at any but the entries modified.
>
> Online check:
> - This would need to be a block job
> - The check function would probably need to be a proper coroutine
>   (that does not just lock everything)
> - Would be very complicated if you wanted it to work on R/W images.
>   It’s probably the best to focus on making this work for read-only
>   images, because you can always just put a temporary snapshot over
>   the image for the time of the test and then commit it down after the
>   check is done.
>
>
> Bitmaps
> =======
>
> (Got this section from sneaking into a BoF I wasn’t invited to.  Oh
> well.  Won’t hurt to include them here.)
>
> Currently, when dirty bitmaps are loaded, all IN_USE bitmaps are just
> not loaded at all and completely ignored.  That isn’t correct, though,
> they should either still be loaded (and automatically treated and
> written back as fully dirty), or at least qemu-img check should
> “repair” them (i.e. fully dirtying them).
>
> Sometimes qemu (running in a mode as bare as possible) is better than
> using qemu-img convert, for instance.  It gives you more control
> (through QMP; you get e.g. better progress reporting), you get all of
> the mirror optimizations (we do have optimizations for convert, too,
> but whether it’s any good to write the same (or different?)
> optimizations twice is another question), and you get a common
> interface for everything (online and offline).
> Note that besides a bare qemu we’ve also always wanted to convert as
> many qemu-img operations into frontends for block jobs as possible.
> We have only done this for commit, however, even though convert looked
> like basically the ideal target.  It was just too hard with too little
> apparent gain, like always (and convert supports additional features
> like concatenation which we don’t have in the runtime block layer
> yet).
>
> Someone (not that someone™, but actually some specific someone) is
> about to make qemu-img info display the list of persistent bitmaps.
> Potential reviewers should be aware of the fact that this should be
> done bye adding that information to ImageInfoSpecificQCow2.
/me

Den

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] KVM Forum block no[td]es
  2018-11-15 14:28     ` Alberto Garcia
@ 2018-11-16 12:14       ` Max Reitz
  2018-11-16 15:03         ` Alberto Garcia
  0 siblings, 1 reply; 22+ messages in thread
From: Max Reitz @ 2018-11-16 12:14 UTC (permalink / raw)
  To: Alberto Garcia, Qemu-block
  Cc: qemu-devel, Markus Armbruster, Denis V. Lunev, Kevin Wolf,
	Vladimir Sementsov-Ogievskiy

[-- Attachment #1: Type: text/plain, Size: 5649 bytes --]

On 15.11.18 15:28, Alberto Garcia wrote:
> On Wed 14 Nov 2018 06:24:10 PM CET, Max Reitz wrote:
>>>> Permission system
>>>> =================
>>>>
>>>> GRAPH_MOD
>>>> ---------
>>>>
>>>> We need some way for the commit job to prevent graph changes on its
>>>> chain while it is running.  Our current blocker doesn’t do the job,
>>>> however.  What to do?
>>>>
>>>> - We have no idea how to make a *permission* work.  Maybe the biggest
>>>>   problem is that it just doesn’t work as a permission, because the
>>>>   commit job doesn’t own the BdrvChildren that would need to be
>>>>   blocked (namely the @backing BdrvChild).
>>>
>>> What I'm doing at the moment in my blockdev-reopen series is check
>>> whether all parents of the node I want to change share the GRAPH_MOD
>>> permission. If any of them doesn't then the operation fails.
>>>
>>> This can be checked by calling bdrv_get_cumulative_perm() or simply
>>> bdrv_check_update_perm(..., BLK_PERM_GRAPH_MOD, ...).
>>
>> Sure.
>>
>>> It solves the problem because e.g. the stream block job only allows
>>> BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED, so no graph
>>> modifications allowed.
>>
>> But the problem is that the commit job needs to unshare the permission
>> on all backing links between base and top, so none of the backing
>> links can be broken.  But it cannot do that because those backing
>> links do not belong to it (but to the respective overlay nodes).
> 
> I'm not sure if I'm following you. The commit block job has references
> to all intermediate nodes (with block_job_add_bdrv()) and only shares
> BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE. Anyone trying to obtain
> BLK_PERM_GRAPH_MOD on an intermediate node would see that the block job
> doesn't allow it.

But first of all, why would anyone try to obtain it?  You only obtain a
permission when you attach a new parent (or when an existing parent
tries to change the ones it has).

For instance, in the commit case, if you just remove a backing link,
then the chain is clearly broken and the graph has been modified, but
noone will have taken any permission to do so.

But even beyond that, commit doesn't care about all of the graph.  All
it cares about are the backing links.  So it should not prevent anyone
else from changing any links outside of the backing chain (which a
permission would clearly do).

Let me give a concrete example:

top (T) -> intermediate (I) -> base (B)

Now you commit from top to base.  Clearly you don't want the backing
chain between top and base to change.  So say you unshare the GRAPH_MOD
permission (implying that whenever a parent doesn't care, it just shares
it).  But as said above, if someone just drops the backing link from I
to B, the permission system won't catch that.

The next thing is that how would you even try to catch it in the first
place?  The normal case would be for everything to both share and claim
the GRAPH_MOD permission, the latter so it can see if something unshared
it.  But that would mean that everything is holding that permission all
the time and it's impossible to unshare it.

(You'd take the GRAPH_MOD permission for the backing link from I to B,
and then commit couldn't unshare it because, well, it's taken already.)

So the solution might be to take it first, and release it after having
attached the child.  And maybe you'd say that for the removal case,
you'd have to take the GRAPH_MOD permission before you remove any child.
 But I think that's weird, and also, there's another issue still.


What if you try to attach a read-only NBD server to B?  That should be
OK in my mind.  But it is a graph change, so it wouldn't be allowed if
GRAPH_MOD is a permission, even though commit doesn't care at all.

The fact is that commit clearly only cares about edges it is not
involved in as either parent or child (the backing links), and therefore
has not control over the permissions taken.  It does not care about any
other edges, which is different from what permissions do, because they
always apply to a whole set of edges.


I don't think anything needs a way to generally block graph changes
around some node.  We only need to prevent changes to very specific sets
of edges.  This is something that the permission system just cannot do.

>>>>   (We never quite knew what “taking the GRAPH_PERMISSION” or
>>>>   “unsharing the GRPAH_MOD permission” was supposed to mean.
>>>>   Figuring that out always took like half an our in any face-to-face
>>>>   meeting, and then we decided it was pretty much useless for any
>>>>   case we had at hand.)
>>>
>>> Yeah it's a bit unclear. It can mean "none of this node's children
>>> can be replaced / removed", which is what I'm using it for at the
>>> moment.
>>
>> Yes, but that's just not useful.  I don't think we have any case where
>> something would be annoyed by having its immediate child replaced
>> (other than the visible data changing, potentially).  Usually when we
>> say something (e.g. a block job) wants to prevent graph modification,
>> it's about changing edges that exist independently of the job (such as
>> a backing chain).
> 
> Isn't that what I just said? You cannot change other edges <=> you
> cannot replace a node's children (or parents).

Hm, true, but I think it's (1) hard to understand, (2) would require
non-trivial changes to the current model anyway (take GRAPH_MOD before
any modification, release it afterwards (which is quite different from
any other permission anyway)), and (3) would be too restrictive.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] KVM Forum block no[td]es
  2018-11-16 12:14       ` Max Reitz
@ 2018-11-16 15:03         ` Alberto Garcia
  2018-11-16 15:18           ` Kevin Wolf
                             ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Alberto Garcia @ 2018-11-16 15:03 UTC (permalink / raw)
  To: Max Reitz, Qemu-block
  Cc: qemu-devel, Markus Armbruster, Denis V. Lunev, Kevin Wolf,
	Vladimir Sementsov-Ogievskiy

On Fri 16 Nov 2018 01:14:37 PM CET, Max Reitz wrote:
>>>>> Permission system
>>>>> =================
>>>>>
>>>>> GRAPH_MOD
>>>>> ---------
>>>>>
>>>>> We need some way for the commit job to prevent graph changes on its
>>>>> chain while it is running.  Our current blocker doesn’t do the job,
>>>>> however.  What to do?
>>>>>
>>>>> - We have no idea how to make a *permission* work.  Maybe the biggest
>>>>>   problem is that it just doesn’t work as a permission, because the
>>>>>   commit job doesn’t own the BdrvChildren that would need to be
>>>>>   blocked (namely the @backing BdrvChild).
>>>>
>>>> What I'm doing at the moment in my blockdev-reopen series is check
>>>> whether all parents of the node I want to change share the GRAPH_MOD
>>>> permission. If any of them doesn't then the operation fails.
>>>>
>>>> This can be checked by calling bdrv_get_cumulative_perm() or simply
>>>> bdrv_check_update_perm(..., BLK_PERM_GRAPH_MOD, ...).
>>>
>>> Sure.
>>>
>>>> It solves the problem because e.g. the stream block job only allows
>>>> BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED, so no graph
>>>> modifications allowed.
>>>
>>> But the problem is that the commit job needs to unshare the permission
>>> on all backing links between base and top, so none of the backing
>>> links can be broken.  But it cannot do that because those backing
>>> links do not belong to it (but to the respective overlay nodes).
>> 
>> I'm not sure if I'm following you. The commit block job has references
>> to all intermediate nodes (with block_job_add_bdrv()) and only shares
>> BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE. Anyone trying to obtain
>> BLK_PERM_GRAPH_MOD on an intermediate node would see that the block job
>> doesn't allow it.
>
> But first of all, why would anyone try to obtain it?  You only obtain
> a permission when you attach a new parent (or when an existing parent
> tries to change the ones it has).

You don't need to obtain it, you only need to see that the permission is
shared (with bdrv_get_cumulative_perm()). Or, as you suggest if I got it
right, attach a new "reopen" parent during bdrv_reopen_prepare() and
release it on commit / abort.

> For instance, in the commit case, if you just remove a backing link,
> then the chain is clearly broken and the graph has been modified, but
> noone will have taken any permission to do so.

If you do what I said in my previous paragraph then you can't remove the
backing link (this is one of my test cases).

> But even beyond that, commit doesn't care about all of the graph.  All
> it cares about are the backing links.  So it should not prevent anyone
> else from changing any links outside of the backing chain (which a
> permission would clearly do).
>
> Let me give a concrete example:
>
> top (T) -> intermediate (I) -> base (B)
>
> Now you commit from top to base.  Clearly you don't want the backing
> chain between top and base to change.  So say you unshare the GRAPH_MOD
> permission (implying that whenever a parent doesn't care, it just shares
> it).  But as said above, if someone just drops the backing link from I
> to B, the permission system won't catch that.

Yes it will, because the block job (either mirror or commit depending on
the case) is the parent of all three nodes (BlockJob.nodes is where the
list is stored) and does not alow GRAPH_MOD in any of them[*]. So
dropping that backing link fails (I also have a test case for that).

> The next thing is that how would you even try to catch it in the first
> place?  The normal case would be for everything to both share and
> claim the GRAPH_MOD permission, the latter so it can see if something
> unshared it.  But that would mean that everything is holding that
> permission all the time and it's impossible to unshare it.

As things are now, GRAPH_MOD is shared by default on backing files
(bdrv_format_default_perms()), but I don't understand why you need that
everything takes GRAPH_MOD, you only need it when you're actually
planning to change the graph (e.g a block job).

> So the solution might be to take it first, and release it after having
> attached the child.  And maybe you'd say that for the removal case,
> you'd have to take the GRAPH_MOD permission before you remove any
> child.

Yes, something like that.

> But I think that's weird, and also, there's another issue still.
>
> What if you try to attach a read-only NBD server to B?  That should be
> OK in my mind.  But it is a graph change, so it wouldn't be allowed if
> GRAPH_MOD is a permission, even though commit doesn't care at all.

I haven't examined that case (I'm not familiar with NBD), but if the
only problem is that the permission system is unnecessarily strict in
certain corner cases then maybe it's reasonable compromise _if_ there's
no simpler alternative (I'm not saying that there isn't!).

> I don't think anything needs a way to generally block graph changes
> around some node.  We only need to prevent changes to very specific
> sets of edges.  This is something that the permission system just
> cannot do.

But what would you do then?

>>>>>   (We never quite knew what “taking the GRAPH_PERMISSION” or
>>>>>   “unsharing the GRPAH_MOD permission” was supposed to mean.
>>>>>   Figuring that out always took like half an our in any
>>>>>   face-to-face meeting, and then we decided it was pretty much
>>>>>   useless for any case we had at hand.)
>>>>
>>>> Yeah it's a bit unclear. It can mean "none of this node's children
>>>> can be replaced / removed", which is what I'm using it for at the
>>>> moment.
>>>
>>> Yes, but that's just not useful.  I don't think we have any case where
>>> something would be annoyed by having its immediate child replaced
>>> (other than the visible data changing, potentially).  Usually when we
>>> say something (e.g. a block job) wants to prevent graph modification,
>>> it's about changing edges that exist independently of the job (such as
>>> a backing chain).
>> 
>> Isn't that what I just said? You cannot change other edges <=> you
>> cannot replace a node's children (or parents).
>
> Hm, true, but I think it's (1) hard to understand, (2) would require
> non-trivial changes to the current model anyway (take GRAPH_MOD before
> any modification, release it afterwards (which is quite different from
> any other permission anyway)), and (3) would be too restrictive.

I don't understand (2), block-stream is already doing that (takes
GRAPH_MOD when you create the block job, releases when the job is
completed).

Berto

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] KVM Forum block no[td]es
  2018-11-16 15:03         ` Alberto Garcia
@ 2018-11-16 15:18           ` Kevin Wolf
  2018-11-16 15:27             ` Max Reitz
  2018-11-19 16:47             ` Alberto Garcia
  2018-11-16 15:19           ` Max Reitz
  2018-11-16 15:20           ` Alberto Garcia
  2 siblings, 2 replies; 22+ messages in thread
From: Kevin Wolf @ 2018-11-16 15:18 UTC (permalink / raw)
  To: Alberto Garcia
  Cc: Max Reitz, Qemu-block, qemu-devel, Markus Armbruster,
	Denis V. Lunev, Vladimir Sementsov-Ogievskiy

Am 16.11.2018 um 16:03 hat Alberto Garcia geschrieben:
> > I don't think anything needs a way to generally block graph changes
> > around some node.  We only need to prevent changes to very specific
> > sets of edges.  This is something that the permission system just
> > cannot do.
> 
> But what would you do then?

I agree with you mostly in that I think that most problems that Max
mentioned aren't readl. The only real problem I see with GRAPH_MOD as a
permission on the node level is this overblocking - but that's bad
enough that I feel using permissions to block changes to a whole node is
not a good solution.

So what's the alternative? Max had a possible solution in the first
email in this thread:

> - A property of BdrvChild that can be set by a non-parent seems more
>   feasible, e.g. a counter where changing the child is possible only
>   if the counter is 0.  This also actually makes sense in what it
>   means.

The commit job would increment BdrvChild.block_change (or whatever we
would call it) for all bs->backing edges in the subchain.

Kevin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] KVM Forum block no[td]es
  2018-11-16 15:03         ` Alberto Garcia
  2018-11-16 15:18           ` Kevin Wolf
@ 2018-11-16 15:19           ` Max Reitz
  2018-11-16 15:20           ` Alberto Garcia
  2 siblings, 0 replies; 22+ messages in thread
From: Max Reitz @ 2018-11-16 15:19 UTC (permalink / raw)
  To: Alberto Garcia, Qemu-block
  Cc: qemu-devel, Markus Armbruster, Denis V. Lunev, Kevin Wolf,
	Vladimir Sementsov-Ogievskiy

[-- Attachment #1: Type: text/plain, Size: 9108 bytes --]

On 16.11.18 16:03, Alberto Garcia wrote:
> On Fri 16 Nov 2018 01:14:37 PM CET, Max Reitz wrote:
>>>>>> Permission system
>>>>>> =================
>>>>>>
>>>>>> GRAPH_MOD
>>>>>> ---------
>>>>>>
>>>>>> We need some way for the commit job to prevent graph changes on its
>>>>>> chain while it is running.  Our current blocker doesn’t do the job,
>>>>>> however.  What to do?
>>>>>>
>>>>>> - We have no idea how to make a *permission* work.  Maybe the biggest
>>>>>>   problem is that it just doesn’t work as a permission, because the
>>>>>>   commit job doesn’t own the BdrvChildren that would need to be
>>>>>>   blocked (namely the @backing BdrvChild).
>>>>>
>>>>> What I'm doing at the moment in my blockdev-reopen series is check
>>>>> whether all parents of the node I want to change share the GRAPH_MOD
>>>>> permission. If any of them doesn't then the operation fails.
>>>>>
>>>>> This can be checked by calling bdrv_get_cumulative_perm() or simply
>>>>> bdrv_check_update_perm(..., BLK_PERM_GRAPH_MOD, ...).
>>>>
>>>> Sure.
>>>>
>>>>> It solves the problem because e.g. the stream block job only allows
>>>>> BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED, so no graph
>>>>> modifications allowed.
>>>>
>>>> But the problem is that the commit job needs to unshare the permission
>>>> on all backing links between base and top, so none of the backing
>>>> links can be broken.  But it cannot do that because those backing
>>>> links do not belong to it (but to the respective overlay nodes).
>>>
>>> I'm not sure if I'm following you. The commit block job has references
>>> to all intermediate nodes (with block_job_add_bdrv()) and only shares
>>> BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE. Anyone trying to obtain
>>> BLK_PERM_GRAPH_MOD on an intermediate node would see that the block job
>>> doesn't allow it.
>>
>> But first of all, why would anyone try to obtain it?  You only obtain
>> a permission when you attach a new parent (or when an existing parent
>> tries to change the ones it has).
> 
> You don't need to obtain it, you only need to see that the permission is
> shared (with bdrv_get_cumulative_perm()). Or, as you suggest if I got it
> right, attach a new "reopen" parent during bdrv_reopen_prepare() and
> release it on commit / abort.

OK, but that's just completely different from how all other permissions
work.

>> For instance, in the commit case, if you just remove a backing link,
>> then the chain is clearly broken and the graph has been modified, but
>> noone will have taken any permission to do so.
> 
> If you do what I said in my previous paragraph then you can't remove the
> backing link (this is one of my test cases).
> 
>> But even beyond that, commit doesn't care about all of the graph.  All
>> it cares about are the backing links.  So it should not prevent anyone
>> else from changing any links outside of the backing chain (which a
>> permission would clearly do).
>>
>> Let me give a concrete example:
>>
>> top (T) -> intermediate (I) -> base (B)
>>
>> Now you commit from top to base.  Clearly you don't want the backing
>> chain between top and base to change.  So say you unshare the GRAPH_MOD
>> permission (implying that whenever a parent doesn't care, it just shares
>> it).  But as said above, if someone just drops the backing link from I
>> to B, the permission system won't catch that.
> 
> Yes it will, because the block job (either mirror or commit depending on
> the case) is the parent of all three nodes (BlockJob.nodes is where the
> list is stored) and does not alow GRAPH_MOD in any of them[*]. So
> dropping that backing link fails (I also have a test case for that).
> 
>> The next thing is that how would you even try to catch it in the first
>> place?  The normal case would be for everything to both share and
>> claim the GRAPH_MOD permission, the latter so it can see if something
>> unshared it.  But that would mean that everything is holding that
>> permission all the time and it's impossible to unshare it.
> 
> As things are now, GRAPH_MOD is shared by default on backing files
> (bdrv_format_default_perms()), but I don't understand why you need that
> everything takes GRAPH_MOD, you only need it when you're actually
> planning to change the graph (e.g a block job).
> 
>> So the solution might be to take it first, and release it after having
>> attached the child.  And maybe you'd say that for the removal case,
>> you'd have to take the GRAPH_MOD permission before you remove any
>> child.
> 
> Yes, something like that.
> 
>> But I think that's weird, and also, there's another issue still.
>>
>> What if you try to attach a read-only NBD server to B?  That should be
>> OK in my mind.  But it is a graph change, so it wouldn't be allowed if
>> GRAPH_MOD is a permission, even though commit doesn't care at all.
> 
> I haven't examined that case (I'm not familiar with NBD),

It doesn't matter whether it's NBD.  What matters is that commit doesn't
care about non-backing relationships.

> but if the
> only problem is that the permission system is unnecessarily strict in
> certain corner cases then maybe it's reasonable compromise _if_ there's
> no simpler alternative (I'm not saying that there isn't!).

But I'm very much saying there is.  GRAPH_MOD is complicated to
understand; to make it work, it would have to work differently from all
other permissions; and it isn't even what we want.

>> I don't think anything needs a way to generally block graph changes
>> around some node.  We only need to prevent changes to very specific
>> sets of edges.  This is something that the permission system just
>> cannot do.
> 
> But what would you do then?

What I wrote in the notes.  Have e.g. a counter in every BdrvChild that
can be incremented by everyone that means "If this is non-zero, do not
modify this BdrvChild".

At the start of the job, commit would go through the backing chain and
increment this counter on every backing link; and it would decrement it
after it's done.

>>>>>>   (We never quite knew what “taking the GRAPH_PERMISSION” or
>>>>>>   “unsharing the GRPAH_MOD permission” was supposed to mean.
>>>>>>   Figuring that out always took like half an our in any
>>>>>>   face-to-face meeting, and then we decided it was pretty much
>>>>>>   useless for any case we had at hand.)
>>>>>
>>>>> Yeah it's a bit unclear. It can mean "none of this node's children
>>>>> can be replaced / removed", which is what I'm using it for at the
>>>>> moment.
>>>>
>>>> Yes, but that's just not useful.  I don't think we have any case where
>>>> something would be annoyed by having its immediate child replaced
>>>> (other than the visible data changing, potentially).  Usually when we
>>>> say something (e.g. a block job) wants to prevent graph modification,
>>>> it's about changing edges that exist independently of the job (such as
>>>> a backing chain).
>>>
>>> Isn't that what I just said? You cannot change other edges <=> you
>>> cannot replace a node's children (or parents).
>>
>> Hm, true, but I think it's (1) hard to understand, (2) would require
>> non-trivial changes to the current model anyway (take GRAPH_MOD before
>> any modification, release it afterwards (which is quite different from
>> any other permission anyway)), and (3) would be too restrictive.
> 
> I don't understand (2), block-stream is already doing that (takes
> GRAPH_MOD when you create the block job, releases when the job is
> completed).

And isn't that just wrong?

You don't have to take it if you want to prevent modifications, you'd
have to unshare it.  Well, it does that, too, but there is no reason for
it to take that permission for a prolonged period of time.  You seem to
agree with me that you'd only take GRAPH_MOD while you actually do some
modification.  I suppose the idea may be that it takes the permission at
the start of the job so it can definitely successfully perform the
change at the end, because it has taken the permission already.  But
that's just wrong, because it takes the permission on the root node,
which is completely unaffected by stream's graph modifications anyway.

What is correct is that it doesn't share GRAPH_MOD for the intermediate
nodes it attaches to the block job.  But that probably doesn't do
anything because nothing cares to set/query/whatever GRAPH_MOD when
modifying the graph.


You seem to be implying that GRAPH_MOD works perfectly well.  But I
think it just does not.  I think most parts of the block layer
completely ignore it, and the rest is torn up in confusion what it's
even supposed to mean.

So even if we decided that GRAPH_MOD is good enough, there'd still be a
ton of work to do.  I very much think it'd be much more work (and not
just boring, tedious, but brain-mangling work that brings you to the
brink of a nervous breakdown) than just adding a BdrvChild.freeze
counter that actually makes sense and actually does what we need.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] KVM Forum block no[td]es
  2018-11-16 15:03         ` Alberto Garcia
  2018-11-16 15:18           ` Kevin Wolf
  2018-11-16 15:19           ` Max Reitz
@ 2018-11-16 15:20           ` Alberto Garcia
  2 siblings, 0 replies; 22+ messages in thread
From: Alberto Garcia @ 2018-11-16 15:20 UTC (permalink / raw)
  To: Max Reitz, Qemu-block
  Cc: qemu-devel, Markus Armbruster, Denis V. Lunev, Kevin Wolf,
	Vladimir Sementsov-Ogievskiy

On Fri 16 Nov 2018 04:03:12 PM CET, Alberto Garcia wrote:
>> top (T) -> intermediate (I) -> base (B)
>>
>> Now you commit from top to base.  Clearly you don't want the backing
>> chain between top and base to change.  So say you unshare the GRAPH_MOD
>> permission (implying that whenever a parent doesn't care, it just shares
>> it).  But as said above, if someone just drops the backing link from I
>> to B, the permission system won't catch that.
>
> Yes it will, because the block job (either mirror or commit depending on
> the case) is the parent of all three nodes (BlockJob.nodes is where the
> list is stored) and does not alow GRAPH_MOD in any of them[*]. So
> dropping that backing link fails (I also have a test case for that).

I forgot to add the footnote [*] in my previous e-mail:

The mirror block job (in particular when in "commit-active" mode, but
also in other cases) creates a "mirror_top" node and then blocks the
target and all intermediate nodes by calling block_job_add_bdrv() in
each one of them:

   mirror_top (M) -> top (T) -> intermediate (I) -> base (B)

The problem is that those intermediate nodes start from top's
backing_bs(), so top itself is not added to the block job's list of
children and doesn't have its permissions restricted during the job's
lifetime.

This looks like a bug, I'll see if there's a way to reproduce and I'll
prepare a patch.

Berto

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] KVM Forum block no[td]es
  2018-11-16 15:18           ` Kevin Wolf
@ 2018-11-16 15:27             ` Max Reitz
  2018-11-16 15:51               ` Kevin Wolf
  2018-11-19 16:47             ` Alberto Garcia
  1 sibling, 1 reply; 22+ messages in thread
From: Max Reitz @ 2018-11-16 15:27 UTC (permalink / raw)
  To: Kevin Wolf, Alberto Garcia
  Cc: Qemu-block, qemu-devel, Markus Armbruster, Denis V. Lunev,
	Vladimir Sementsov-Ogievskiy

[-- Attachment #1: Type: text/plain, Size: 1341 bytes --]

On 16.11.18 16:18, Kevin Wolf wrote:
> Am 16.11.2018 um 16:03 hat Alberto Garcia geschrieben:
>>> I don't think anything needs a way to generally block graph changes
>>> around some node.  We only need to prevent changes to very specific
>>> sets of edges.  This is something that the permission system just
>>> cannot do.
>>
>> But what would you do then?
> 
> I agree with you mostly in that I think that most problems that Max
> mentioned aren't readl. The only real problem I see with GRAPH_MOD as a
> permission on the node level is this overblocking

I wholeheartedly disagree.  Yes, it is true that most of the issues I
thought of can be fixed, and so those problems are not problems in a
technical sense.  But to me this whole discussion points to the greatest
issue I have, which is that GRAPH_MOD is just too complicated to
understand.  And I don't like a solution that works on a technical level
but that everybody is too afraid to touch because it's too weird.

We have this discussion again and again, and in the end we always come
up with something that looks like it might work, but it's just so weird
that we can't even remember it.

Maybe it's just me, though.  Frankly, I think the permission system
itself is already too complicated as it is, but I don't have a simpler
solution there.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] KVM Forum block no[td]es
  2018-11-16 15:27             ` Max Reitz
@ 2018-11-16 15:51               ` Kevin Wolf
  2018-11-16 16:34                 ` Max Reitz
  0 siblings, 1 reply; 22+ messages in thread
From: Kevin Wolf @ 2018-11-16 15:51 UTC (permalink / raw)
  To: Max Reitz
  Cc: Alberto Garcia, Qemu-block, qemu-devel, Markus Armbruster,
	Denis V. Lunev, Vladimir Sementsov-Ogievskiy

[-- Attachment #1: Type: text/plain, Size: 2231 bytes --]

Am 16.11.2018 um 16:27 hat Max Reitz geschrieben:
> On 16.11.18 16:18, Kevin Wolf wrote:
> > Am 16.11.2018 um 16:03 hat Alberto Garcia geschrieben:
> >>> I don't think anything needs a way to generally block graph changes
> >>> around some node.  We only need to prevent changes to very specific
> >>> sets of edges.  This is something that the permission system just
> >>> cannot do.
> >>
> >> But what would you do then?
> > 
> > I agree with you mostly in that I think that most problems that Max
> > mentioned aren't readl. The only real problem I see with GRAPH_MOD as a
> > permission on the node level is this overblocking
> 
> I wholeheartedly disagree.  Yes, it is true that most of the issues I
> thought of can be fixed, and so those problems are not problems in a
> technical sense.  But to me this whole discussion points to the greatest
> issue I have, which is that GRAPH_MOD is just too complicated to
> understand.  And I don't like a solution that works on a technical level
> but that everybody is too afraid to touch because it's too weird.

GRAPH_MOD with the meaning that Berto suggested isn't weird or
complicated to understand. It's only the wrong tool because it blocks
more than we want to block. But if we didn't care about that, it could
be just another permission like any other.

If you want to change the graph, you'd need to get the permission first,
and bdrv_replace_child_noperm() could assert that at least one parent of
the parent node has acquired the permission (unless you want to pass the
exact parent BdrvChild to it; maybe this is what we would really do
then).

> We have this discussion again and again, and in the end we always come
> up with something that looks like it might work, but it's just so weird
> that we can't even remember it.

I don't think we ever come up with something, weird or not, that
achieves what we wanted to achieve - because the problem simply can't be
solved properly at the node level.

> Maybe it's just me, though.  Frankly, I think the permission system
> itself is already too complicated as it is, but I don't have a simpler
> solution there.

It doesn't feel too bad to me, but that's subjective.

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] KVM Forum block no[td]es
  2018-11-16 15:51               ` Kevin Wolf
@ 2018-11-16 16:34                 ` Max Reitz
  2018-11-16 17:13                   ` Kevin Wolf
  2018-11-16 17:16                   ` Alberto Garcia
  0 siblings, 2 replies; 22+ messages in thread
From: Max Reitz @ 2018-11-16 16:34 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Alberto Garcia, Qemu-block, qemu-devel, Markus Armbruster,
	Denis V. Lunev, Vladimir Sementsov-Ogievskiy

[-- Attachment #1: Type: text/plain, Size: 5949 bytes --]

On 16.11.18 16:51, Kevin Wolf wrote:
> Am 16.11.2018 um 16:27 hat Max Reitz geschrieben:
>> On 16.11.18 16:18, Kevin Wolf wrote:
>>> Am 16.11.2018 um 16:03 hat Alberto Garcia geschrieben:
>>>>> I don't think anything needs a way to generally block graph changes
>>>>> around some node.  We only need to prevent changes to very specific
>>>>> sets of edges.  This is something that the permission system just
>>>>> cannot do.
>>>>
>>>> But what would you do then?
>>>
>>> I agree with you mostly in that I think that most problems that Max
>>> mentioned aren't readl. The only real problem I see with GRAPH_MOD as a
>>> permission on the node level is this overblocking
>>
>> I wholeheartedly disagree.  Yes, it is true that most of the issues I
>> thought of can be fixed, and so those problems are not problems in a
>> technical sense.  But to me this whole discussion points to the greatest
>> issue I have, which is that GRAPH_MOD is just too complicated to
>> understand.  And I don't like a solution that works on a technical level
>> but that everybody is too afraid to touch because it's too weird.
> 
> GRAPH_MOD with the meaning that Berto suggested isn't weird or
> complicated to understand. It's only the wrong tool because it blocks
> more than we want to block. But if we didn't care about that, it could
> be just another permission like any other.

The meaning Berto has suggested (AFAIU) is never taking the permission
at all but just querying whether it is shared by all current parents
before doing a graph change.

That is very weird to me because permissions are there to be taken.


The meaning I have suggested is that it would be taken before a graph
change and released right afterwards.  I think this is still weird
because we don't take, say, the WRITE permission before every
bdrv*_write*() call and release it afterwards.

Sure you could say "actually, why not".  Please don't.[1]

> If you want to change the graph, you'd need to get the permission first,
> and bdrv_replace_child_noperm() could assert that at least one parent of
> the parent node has acquired the permission (unless you want to pass the
> exact parent BdrvChild to it; maybe this is what we would really do
> then).

Yep, that assertion would be like we do it in bdrv_co_write_req_prepare().

Thing is, you (as in "the caller that actually wants to do something",
like mirror) don't just "get the permission".  Permissions are
associated with children, so you cannot get this permission before you
create your child.

So this is where "parent of a parent node" comes in?  I'm afraid I don't
understand that.

What would make sense to me would be for the generic block layer to take
this permission on new children (and drop it immediately after having
attached the child) and children that are about to be detached.  Which
again is just different from any other permission, because things
outside of block.c can never take it, they can only choose not to share
it.  So it'd be an internal permission, which is visible to the outside,
and I think this is at least weirder than just having an internal
counter with a clear external interface (inc/dec).


Also, please excuse me for it being Friday and all, but I don't quite
believe in "It's actually clear, I have no idea why you claim to always
take so long to understand it."  The code clearly has no idea what it's
supposed to do, and I do not believe that the sole reason for that is
that someone who did understand didn't have the time or the need to fix
it, or to implement it correctly in the first place.

On the other hand it sounded a bit like Berto was about to fix it,
though, so there's that.

Or I just got you wrong and this is just the usual case of "Now that
we've talked about this through a couple of lengthy emails, it does make
sense" which is just all too familiar and exactly my issue with the
whole thing.

>> We have this discussion again and again, and in the end we always come
>> up with something that looks like it might work, but it's just so weird
>> that we can't even remember it.
> 
> I don't think we ever come up with something, weird or not, that
> achieves what we wanted to achieve - because the problem simply can't be
> solved properly at the node level.

Yes, I've phrased my disagreement wrong.  Of course I don't disagree
with you on that.  I just disagree that this is the only issue with
GRAPH_MOD.

>> Maybe it's just me, though.  Frankly, I think the permission system
>> itself is already too complicated as it is, but I don't have a simpler
>> solution there.
> 
> It doesn't feel too bad to me, but that's subjective.

It's not worse than quiescing/draining, that's for sure.

Max


[1] Please don't because on one hand I wouldn't have a counter-argument
for this.  Yes, maybe we should actually take permissions right before
the operations that require them.  Maybe keeping them around for as long
as a child stays attached to a parent is just being lazy.

I suppose the main argument against taking permissions just when you
need them is that this introduces additional points of failure that are
mostly unnecessary.  For GRAPH_MOD, however, these points would be
limited and easily handled, because they just involve making
bdrv_attach_child() fail.

So, no, I do not have a good technical counter-argument.  But my
argument of complexity still stands after having gone through
understanding how GRAPH_MOD might actually work for at least the third
time now, without it ever having made the next time any easier.

Even if GRAPH_MOD were correctly implemented tomorrow, in a way
conforming to what we've discussed now, I have the strong feeling that I
still would get to a point where I forgot its meaning and would have to
go through all of this again.

And that's completely disregarding people new to the codebase.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] KVM Forum block no[td]es
  2018-11-16 16:34                 ` Max Reitz
@ 2018-11-16 17:13                   ` Kevin Wolf
  2018-11-16 18:23                     ` Max Reitz
  2018-11-16 17:16                   ` Alberto Garcia
  1 sibling, 1 reply; 22+ messages in thread
From: Kevin Wolf @ 2018-11-16 17:13 UTC (permalink / raw)
  To: Max Reitz
  Cc: Alberto Garcia, Qemu-block, qemu-devel, Markus Armbruster,
	Denis V. Lunev, Vladimir Sementsov-Ogievskiy

[-- Attachment #1: Type: text/plain, Size: 9810 bytes --]

Am 16.11.2018 um 17:34 hat Max Reitz geschrieben:
> On 16.11.18 16:51, Kevin Wolf wrote:
> > Am 16.11.2018 um 16:27 hat Max Reitz geschrieben:
> >> On 16.11.18 16:18, Kevin Wolf wrote:
> >>> Am 16.11.2018 um 16:03 hat Alberto Garcia geschrieben:
> >>>>> I don't think anything needs a way to generally block graph changes
> >>>>> around some node.  We only need to prevent changes to very specific
> >>>>> sets of edges.  This is something that the permission system just
> >>>>> cannot do.
> >>>>
> >>>> But what would you do then?
> >>>
> >>> I agree with you mostly in that I think that most problems that Max
> >>> mentioned aren't readl. The only real problem I see with GRAPH_MOD as a
> >>> permission on the node level is this overblocking
> >>
> >> I wholeheartedly disagree.  Yes, it is true that most of the issues I
> >> thought of can be fixed, and so those problems are not problems in a
> >> technical sense.  But to me this whole discussion points to the greatest
> >> issue I have, which is that GRAPH_MOD is just too complicated to
> >> understand.  And I don't like a solution that works on a technical level
> >> but that everybody is too afraid to touch because it's too weird.
> > 
> > GRAPH_MOD with the meaning that Berto suggested isn't weird or
> > complicated to understand. It's only the wrong tool because it blocks
> > more than we want to block. But if we didn't care about that, it could
> > be just another permission like any other.
> 
> The meaning Berto has suggested (AFAIU) is never taking the permission
> at all but just querying whether it is shared by all current parents
> before doing a graph change.
> 
> That is very weird to me because permissions are there to be taken.

It would be a shortcut for taking the permission temporarily, which
would possibly work inside QEMU because we're holding the BQL, though
I'm not sure whether I/O threads couldn't influence this. Anyway, with
our mapping of permissions to filesystem locks, it's definitely not
correct any more and you do need to actually temporarily acquire the
locks.

> The meaning I have suggested is that it would be taken before a graph
> change and released right afterwards.  I think this is still weird
> because we don't take, say, the WRITE permission before every
> bdrv*_write*() call and release it afterwards.
> 
> Sure you could say "actually, why not".  Please don't.[1]

Actually, why not? Not in the sense that it would be the right model for
WRITE, but it might still be the right one for GRAPH_MOD in most cases.

We do take the RESIZE permission in qmp_block_resize() before calling
blk_truncate(), and drop it right afterwards (by means of creating a
temporary BlockBackend only for the purpose of resizing), so there is
precedence.

Of course, all of this is moot because GRAPH_MOD can't provide the exact
semantics we want to have.

> > If you want to change the graph, you'd need to get the permission first,
> > and bdrv_replace_child_noperm() could assert that at least one parent of
> > the parent node has acquired the permission (unless you want to pass the
> > exact parent BdrvChild to it; maybe this is what we would really do
> > then).
> 
> Yep, that assertion would be like we do it in bdrv_co_write_req_prepare().
> 
> Thing is, you (as in "the caller that actually wants to do something",
> like mirror) don't just "get the permission".  Permissions are
> associated with children, so you cannot get this permission before you
> create your child.
> 
> So this is where "parent of a parent node" comes in?  I'm afraid I don't
> understand that.

Including GRAPH_MOD in the permission system means that it controls an
operation that reaches a node through a BdrvChild. In this case, the
operation is "modifiying a BdrvChild of the node that the permission
holding BdrvChild points to".

Okay, if you say that affecting children of a node instead of just the
node itself is a bit unusual, I might have to agree.

> What would make sense to me would be for the generic block layer to take
> this permission on new children (and drop it immediately after having
> attached the child) and children that are about to be detached.  Which
> again is just different from any other permission, because things
> outside of block.c can never take it, they can only choose not to share
> it.  So it'd be an internal permission, which is visible to the outside,
> and I think this is at least weirder than just having an internal
> counter with a clear external interface (inc/dec).

I don't think I'd want to implement it like that (but then, I'll repeat
that I wouldn't want to implement it at all because it's the wrong
tool).

> Also, please excuse me for it being Friday and all, but I don't quite
> believe in "It's actually clear, I have no idea why you claim to always
> take so long to understand it."  The code clearly has no idea what it's
> supposed to do, and I do not believe that the sole reason for that is
> that someone who did understand didn't have the time or the need to fix
> it, or to implement it correctly in the first place.
> 
> On the other hand it sounded a bit like Berto was about to fix it,
> though, so there's that.
> 
> Or I just got you wrong and this is just the usual case of "Now that
> we've talked about this through a couple of lengthy emails, it does make
> sense" which is just all too familiar and exactly my issue with the
> whole thing.

It's actually "now that we have a precise definition what GRAPH_MOD
means", at least for the sake of this discussion. It's not a useful
definition, so I don't think this is actually meaningful progress, but
having this definition, you can reason about it now. It's not very
complicated on the conceptual level (the implementation looks differernt
because we'd have to introduce another BdrvChild parameter everywhere to
know where the operation comes from), just not a very good tool either.

What we noted down about counters was more meaningful progress because
we think this can actually provide the semantics that we need. Of
course, there is still some hand-waving involved (we'll just check the
counter when modifying the graph), which might not be that trivial to
implement because bdrv_replace_child() can't return an error...

> >> We have this discussion again and again, and in the end we always come
> >> up with something that looks like it might work, but it's just so weird
> >> that we can't even remember it.
> > 
> > I don't think we ever come up with something, weird or not, that
> > achieves what we wanted to achieve - because the problem simply can't be
> > solved properly at the node level.
> 
> Yes, I've phrased my disagreement wrong.  Of course I don't disagree
> with you on that.  I just disagree that this is the only issue with
> GRAPH_MOD.
> 
> >> Maybe it's just me, though.  Frankly, I think the permission system
> >> itself is already too complicated as it is, but I don't have a simpler
> >> solution there.
> > 
> > It doesn't feel too bad to me, but that's subjective.
> 
> It's not worse than quiescing/draining, that's for sure.

How would _you_ know? :-)

(But I agree.)

> [1] Please don't because on one hand I wouldn't have a counter-argument
> for this.  Yes, maybe we should actually take permissions right before
> the operations that require them.  Maybe keeping them around for as long
> as a child stays attached to a parent is just being lazy.

No, it's the very mechanism that requires users throughout the stack to
declare what they want to do with their nodes. If you just automatically
got write permissions when you do a write request, you could attach
read-write devices to a read-only node and things would go up in flames
as soon as the device actually writes.

This would be consistent with GRAPH_MOD (or actually even the counters),
if block jobs requested this from the beginning and not only when they
complete.

For the implementation with counters, maybe we actually need two of them
like perm/shared in the permission system. One that means that we're
going to modify the graph, and the other one that means that we can't
tolerate graph modifications.

For the public interface, we might actually treat it mostly like
permissions, just edge permissions, not node permissions.

> I suppose the main argument against taking permissions just when you
> need them is that this introduces additional points of failure that are
> mostly unnecessary.  For GRAPH_MOD, however, these points would be
> limited and easily handled, because they just involve making
> bdrv_attach_child() fail.

You probably don't even want to start the whole graph modification
(which may well involve multiple changed children) when one of the
changes is going to fail. So I think taking the permission beforehand is
useful.

I'd consider taking it even at the job start, but the latest I would
take it is at the start of the .prepare callback.

(This is relevant for the counters, too.)

> So, no, I do not have a good technical counter-argument.  But my
> argument of complexity still stands after having gone through
> understanding how GRAPH_MOD might actually work for at least the third
> time now, without it ever having made the next time any easier.
> 
> Even if GRAPH_MOD were correctly implemented tomorrow, in a way
> conforming to what we've discussed now, I have the strong feeling that I
> still would get to a point where I forgot its meaning and would have to
> go through all of this again.
> 
> And that's completely disregarding people new to the codebase.

As I said, it's moot anyway. It doesn't provide the right semantics.

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] KVM Forum block no[td]es
  2018-11-16 16:34                 ` Max Reitz
  2018-11-16 17:13                   ` Kevin Wolf
@ 2018-11-16 17:16                   ` Alberto Garcia
  1 sibling, 0 replies; 22+ messages in thread
From: Alberto Garcia @ 2018-11-16 17:16 UTC (permalink / raw)
  To: Max Reitz, Kevin Wolf
  Cc: Qemu-block, qemu-devel, Markus Armbruster, Denis V. Lunev,
	Vladimir Sementsov-Ogievskiy

(I just wanted to reply quickly to this point, I'll read the rest of the
e-mail later)

On Fri 16 Nov 2018 05:34:08 PM CET, Max Reitz <mreitz@redhat.com> wrote:
> > GRAPH_MOD with the meaning that Berto suggested isn't weird or
> > complicated to understand. It's only the wrong tool because it
> > blocks more than we want to block. But if we didn't care about that,
> > it could be just another permission like any other.

That's the way that I understood it. Or, more precisely, I'm basing my
current code on those assumptions because it's a simple starting point
to have something working with the current permission system. But I'm of
course open to trying a different approach.

> The meaning Berto has suggested (AFAIU) is never taking the permission
> at all but just querying whether it is shared by all current parents
> before doing a graph change.

That would be a quick way to test it, and I agree that it's unorthodox,
but it's useful to explain how I understood the meaning of GRAPH_MOD.

But you could for example make BDRVReopenState a parent of the BDS (and
replace BDRVReopenState.bs with a BdrvChild).

Berto

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] KVM Forum block no[td]es
  2018-11-16 17:13                   ` Kevin Wolf
@ 2018-11-16 18:23                     ` Max Reitz
  0 siblings, 0 replies; 22+ messages in thread
From: Max Reitz @ 2018-11-16 18:23 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Alberto Garcia, Qemu-block, qemu-devel, Markus Armbruster,
	Denis V. Lunev, Vladimir Sementsov-Ogievskiy

[-- Attachment #1: Type: text/plain, Size: 3257 bytes --]

I feel bad for trimming your mail so much, but that's just because I've
read it and I agree.  So, it's trimming of the good kind.
(Then again, when is trimming of a long mail ever bad?)

On 16.11.18 18:13, Kevin Wolf wrote:

[...]

> What we noted down about counters was more meaningful progress because
> we think this can actually provide the semantics that we need. Of
> course, there is still some hand-waving involved (we'll just check the
> counter when modifying the graph), which might not be that trivial to
> implement because bdrv_replace_child() can't return an error...

Sure, there is much to talk about there.  You're right, we probably need
a transaction system there, too, which makes things not so simple again.

And then there's the fact that bdrv_detach_child() currently cannot
fail, and we really don't want it to fail in e.g. bdrv_close().
Expressing that may be tricky, too.

>>> It doesn't feel too bad to me, but that's subjective.
>>
>> It's not worse than quiescing/draining, that's for sure.
> 
> How would _you_ know? :-)

I seem to recall to at least have read your patches... ;-)

[...]

> For the implementation with counters, maybe we actually need two of them
> like perm/shared in the permission system. One that means that we're
> going to modify the graph, and the other one that means that we can't
> tolerate graph modifications.
> 
> For the public interface, we might actually treat it mostly like
> permissions, just edge permissions, not node permissions.

Hm.  I see your point.  But having a "perm" field which would allow a
party to be able to do graph modifications for as long as it has claimed
that field would mean that the probably also want other parties not to
modify that exact same part of the graph in the meantime.  Like commit
which will modify part of the graph while over the whole lifetime of the
job it doesn't want that part to be modified by something else.

But then again, exactly that "perm" field solves the issue, too.  If you
are required to take the permission (and in the process block other
modifications on that edge) before doing graph modifications, the check
function can simply see that there is exactly one blocker, and infer
that this must be the party doing the modification itself.

>> So, no, I do not have a good technical counter-argument.  But my
>> argument of complexity still stands after having gone through
>> understanding how GRAPH_MOD might actually work for at least the third
>> time now, without it ever having made the next time any easier.
>>
>> Even if GRAPH_MOD were correctly implemented tomorrow, in a way
>> conforming to what we've discussed now, I have the strong feeling that I
>> still would get to a point where I forgot its meaning and would have to
>> go through all of this again.
>>
>> And that's completely disregarding people new to the codebase.
> 
> As I said, it's moot anyway. It doesn't provide the right semantics.

True, but the discussion itself isn't moot to me.  I wanted to stress
that given something that is rather hard to grasp, I prefer to throw it
away as long as it is not functional yet, whenever a more intuitive
alternative presents itself.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] KVM Forum block no[td]es
  2018-11-16 15:18           ` Kevin Wolf
  2018-11-16 15:27             ` Max Reitz
@ 2018-11-19 16:47             ` Alberto Garcia
  1 sibling, 0 replies; 22+ messages in thread
From: Alberto Garcia @ 2018-11-19 16:47 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Max Reitz, Qemu-block, qemu-devel, Markus Armbruster,
	Denis V. Lunev, Vladimir Sementsov-Ogievskiy

On Fri 16 Nov 2018 04:18:34 PM CET, Kevin Wolf wrote:
> Am 16.11.2018 um 16:03 hat Alberto Garcia geschrieben:
>> > I don't think anything needs a way to generally block graph changes
>> > around some node.  We only need to prevent changes to very specific
>> > sets of edges.  This is something that the permission system just
>> > cannot do.
>> 
>> But what would you do then?
>
> I agree with you mostly in that I think that most problems that Max
> mentioned aren't readl. The only real problem I see with GRAPH_MOD as
> a permission on the node level is this overblocking - but that's bad
> enough that I feel using permissions to block changes to a whole node
> is not a good solution.

I understand that this restriction would be too coarse, but do you have
any case in mind where this would be a problem, or are you simply
worried because it's not an elegant solution?

(not that I'm resisting the proposal of using a counter, I'm just
curious)

> So what's the alternative? Max had a possible solution in the first
> email in this thread:
>
>> - A property of BdrvChild that can be set by a non-parent seems more
>>   feasible, e.g. a counter where changing the child is possible only
>>   if the counter is 0.  This also actually makes sense in what it
>>   means.
>
> The commit job would increment BdrvChild.block_change (or whatever we
> would call it) for all bs->backing edges in the subchain.

So instead of having a permission blocking changes in all of a node's
links, this would only affect the backing links.

It should work, I can give it a try.

Does it make sense to keep the GRAPH_MOD permission at all if we're not
really using it and we don't know what it is for? We can just drop it,
or are we breaking the API or the ABI?

Berto

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] [Qemu-block] KVM Forum block no[td]es
  2018-11-12 16:10     ` Nir Soffer
@ 2018-11-21  1:24       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 22+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2018-11-21  1:24 UTC (permalink / raw)
  To: Nir Soffer, Max Reitz
  Cc: qemu-block, Kevin Wolf, QEMU Developers, Markus Armbruster, Denis Lunev


On 12.11.2018 16:10, Nir Soffer wrote:
On Mon, Nov 12, 2018 at 5:26 PM Max Reitz <mreitz@redhat.com<mailto:mreitz@redhat.com>> wrote:
On 12.11.18 00:36, Nir Soffer wrote:
> On Mon, Nov 12, 2018 at 12:25 AM Max Reitz <mreitz@redhat.com<mailto:mreitz@redhat.com>
> <mailto:mreitz@redhat.com<mailto:mreitz@redhat.com>>> wrote:
>
>     This is what I’ve taken from two or three BoF-like get-togethers on
>     blocky things.  Amendments are more than welcome, of course.
>
> ...
>
>     Bitmaps
>
>     =======
>
>     (Got this section from sneaking into a BoF I wasn’t invited to.  Oh
>     well.  Won’t hurt to include them here.)
>
>     Currently, when dirty bitmaps are loaded, all IN_USE bitmaps are just
>     not loaded at all and completely ignored.  That isn’t correct, though,
>     they should either still be loaded (and automatically treated and
>     written back as fully dirty), or at least qemu-img check should
>     “repair” them (i.e. fully dirtying them).
>
>
> I'm not sure making bitmaps dirty is better.
>
> When bitmap is marked IN_USE, it means that we cannot use it for
> backup. Deleting the bitmap or making it as bad so it cannot be used
> for the next backup is the same. Making the bitmap as dirty will full
> the management layer that everything was fine when the next backup
> includes the entire disk. It is better to cause the next backup to fail
> in a verbose way, since the backup software can recover by doing
> a full backup.

But making the dirty bitmap fully dirty will automatically enforce a
full backup.

True, but is also hide the fact that we lost the bitmap. I think we should
start with the simple way of making it fail and let the rest of the stack
fallback to full backup.


I agree. Management tool should decide what to do if we loose a bitmap. Having absolutely dirty dirty-bitmap is actually useless, as it's equal to not having a bitmap.

Anyway, we can add some flags to qemu-img check, to fix IN_USE bitmaps somehow. I thing the most simple and useful option would be just remove IN_USE bitmaps.


>     Sometimes qemu (running in a mode as bare as possible) is better than
>     using qemu-img convert, for instance.  It gives you more control
>     (through QMP; you get e.g. better progress reporting), you get all of
>     the mirror optimizations (we do have optimizations for convert, too,
>     but whether it’s any good to write the same (or different?)
>     optimizations twice is another question), and you get a common
>     interface for everything (online and offline).
>     Note that besides a bare qemu we’ve also always wanted to convert as
>     many qemu-img operations into frontends for block jobs as possible.
>     We have only done this for commit, however, even though convert looked
>     like basically the ideal target.  It was just too hard with too little
>     apparent gain, like always (and convert supports additional features
>     like concatenation which we don’t have in the runtime block layer
>     yet).
>
>
> I'm not sure it is better to run qemu and use qemu-img as thin wrapper
> over qmp.
>
> For example, management system may ascociate qemu-img
> with a sanlock lease, and sanlock may try to terminate qemu-img when
> a lease is invalidated. In this case sanlock will succeed while qemu is till
> accessing storage it should not access.
> ...

The point was not to run two processes, but to link the necessary bits
of qemu into qemu-img and then use them inside the single qemu-img
process itself.  As hinted at above, we've been doing this for qemu-img
commit for quite some time; that command actually runs a block job under
the hood (inside of qemu-img).

Sounds great.


Original idea was not to use qemu-img at all, but only qemu..


Nir

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2018-11-21  1:24 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-11 22:25 [Qemu-devel] KVM Forum block no[td]es Max Reitz
2018-11-11 23:36 ` [Qemu-devel] [Qemu-block] " Nir Soffer
2018-11-12 15:25   ` Max Reitz
2018-11-12 16:10     ` Nir Soffer
2018-11-21  1:24       ` Vladimir Sementsov-Ogievskiy
2018-11-14 19:38     ` John Snow
2018-11-13 15:12 ` [Qemu-devel] " Alberto Garcia
2018-11-14 17:24   ` Max Reitz
2018-11-15 14:28     ` Alberto Garcia
2018-11-16 12:14       ` Max Reitz
2018-11-16 15:03         ` Alberto Garcia
2018-11-16 15:18           ` Kevin Wolf
2018-11-16 15:27             ` Max Reitz
2018-11-16 15:51               ` Kevin Wolf
2018-11-16 16:34                 ` Max Reitz
2018-11-16 17:13                   ` Kevin Wolf
2018-11-16 18:23                     ` Max Reitz
2018-11-16 17:16                   ` Alberto Garcia
2018-11-19 16:47             ` Alberto Garcia
2018-11-16 15:19           ` Max Reitz
2018-11-16 15:20           ` Alberto Garcia
2018-11-16  7:47 ` Denis V.Lunev

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.