Am 06.02.2020 um 16:19 hat Max Reitz geschrieben:
> On 06.02.20 15:42, Kevin Wolf wrote:
> > Am 06.02.2020 um 11:21 hat Max Reitz geschrieben:
> >> On 05.02.20 16:55, Kevin Wolf wrote:
> >>> Am 11.11.2019 um 17:02 hat Max Reitz geschrieben:
> >>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
> >>>> ---
> >>>>  block/quorum.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>  1 file changed, 62 insertions(+)
> >>>>
> >>>> diff --git a/block/quorum.c b/block/quorum.c
> >>>> index 3a824e77e3..8ee03e9baf 100644
> >>>> --- a/block/quorum.c
> >>>> +++ b/block/quorum.c
> >>>> @@ -825,6 +825,67 @@ static bool quorum_recurse_is_first_non_filter(BlockDriverState *bs,
> >>>>      return false;
> >>>>  }
> >>>>  
> >>>> +static bool quorum_recurse_can_replace(BlockDriverState *bs,
> >>>> +                                       BlockDriverState *to_replace)
> >>>> +{
> >>>> +    BDRVQuorumState *s = bs->opaque;
> >>>> +    int i;
> >>>> +
> >>>> +    for (i = 0; i < s->num_children; i++) {
> >>>> +        /*
> >>>> +         * We have no idea whether our children show the same data as
> >>>> +         * this node (@bs).  It is actually highly likely that
> >>>> +         * @to_replace does not, because replacing a broken child is
> >>>> +         * one of the main use cases here.
> >>>> +         *
> >>>> +         * We do know that the new BDS will match @bs, so replacing
> >>>> +         * any of our children by it will be safe.  It cannot change
> >>>> +         * the data this quorum node presents to its parents.
> >>>> +         *
> >>>> +         * However, replacing @to_replace by @bs in any of our
> >>>> +         * children's chains may change visible data somewhere in
> >>>> +         * there.  We therefore cannot recurse down those chains with
> >>>> +         * bdrv_recurse_can_replace().
> >>>> +         * (More formally, bdrv_recurse_can_replace() requires that
> >>>> +         * @to_replace will be replaced by something matching the @bs
> >>>> +         * passed to it.  We cannot guarantee that.)
> >>>> +         *
> >>>> +         * Thus, we can only check whether any of our immediate
> >>>> +         * children matches @to_replace.
> >>>> +         *
> >>>> +         * (In the future, we might add a function to recurse down a
> >>>> +         * chain that checks that nothing there cares about a change
> >>>> +         * in data from the respective child in question.  For
> >>>> +         * example, most filters do not care when their child's data
> >>>> +         * suddenly changes, as long as their parents do not care.)
> >>>> +         */
> >>>> +        if (s->children[i].child->bs == to_replace) {
> >>>> +            Error *local_err = NULL;
> >>>> +
> >>>> +            /*
> >>>> +             * We now have to ensure that there is no other parent
> >>>> +             * that cares about replacing this child by a node with
> >>>> +             * potentially different data.
> >>>> +             */
> >>>> +            s->children[i].to_be_replaced = true;
> >>>> +            bdrv_child_refresh_perms(bs, s->children[i].child, &local_err);
> >>>> +
> >>>> +            /* Revert permissions */
> >>>> +            s->children[i].to_be_replaced = false;
> >>>> +            bdrv_child_refresh_perms(bs, s->children[i].child, &error_abort);
> >>>
> >>> Quite a hack. The two obvious problems are:
> >>>
> >>> 1. We can't guarantee that we can actually revert the permissions. I
> >>>    think we ignore failure to loosen permissions meanwhile so that at
> >>>    least the &error_abort doesn't trigger, but bs could still be in the
> >>>    wrong state afterwards.
> >>
> >> I thought we guaranteed that loosening permissions never fails.
> >>
> >> (Well, you know.  It may “leak” permissions, but we’d never get an error
> >> here so there’s nothing to handle anyway.)
> > 
> > This is what I meant. We ignore the failure (i.e. don't return an error),
> > but the result still isn't completely correct ("leaked" permissions).
> > 
> >>>    It would be cleaner to use check+abort instead of actually setting
> >>>    the new permission.
> >>
> >> Oh.  Yes.  Maybe.  It does require more code, though, because I’d rather
> >> not use bdrv_check_update_perm() from here as-is.
> > 
> > I'm not saying you need to do it, just that it would be cleaner. :-)
> 
> It would.  Thanks for the suggestion, I obviously didn’t think of it.
> (Or there’d be a comment on how this is not the best way in theory, but
> in practice it’s good enough.)  I suppose I’ll see how what I can do.
> 
> >>> 2. As aborting the permission change makes more obvious, we're checking
> >>>    something that might not be true any more when we actually make the
> >>>    change.
> >>
> >> True.  I tried to do it right by having a post-replace cleanup function,
> >> but after a while that was just going nowhere, really.  So I just went
> >> with what’s patch 13 here.
> >>
> >> But isn’t 13 enough, actually?  It check can_replace right before
> >> replacing in a drained section.  I can’t imagine the permissions to
> >> change there.
> > 
> > Permissions are tied to file locks, so an external process can just grab
> > the locks in between.
> 
> Ah, right, I didn’t think of that.
> 
> > But if I understand correctly, all we try here is
> > to have an additional safeguard to prevent the user from doing stupid
> > things. So I guess not being 100% is fine as long as it's documented in
> > the code.
> 
> Yes.  I just think it actually would be 100 % in practice, so I wondered
> whether it would need to be documented.
> 
> You’re right, though, it isn’t 100 %, so it should definitely be
> documented.  Maybe something like
> 
> In theory, we would have to keep the permissions tightened until the
> node is replaced.  In practice, that would require post-replacement
> cleanup infrastructure, which we do not have, and which would be
> unreasonably complex to implement.

Sounds good until here.

> Therefore, all we can do is require
> anyone who wants to replace one node by some potentially unrelated other
> node (i.e., the mirror job on completion) to invoke
> bdrv_recurse_can_replace() immediately before and thus minimize the time
> during which some condition may arise that might forbid the swap.
> 
> ?

This second part of your suggested comment could be dropped, as far as
I'm concerned. If anything, it's part of the contract and would belong
in the bdrv_recurse_can_replace() documentation.

However, I think I would mention why not being 100% is okay: The part
with "additional safeguard to prevent the user from doing stupid
things", and that it doesn't make a difference if the user runs the
correct command.

Kevin