All of lore.kernel.org
 help / color / mirror / Atom feed
From: Max Reitz <mreitz@redhat.com>
To: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
	qemu-devel@nongnu.org, qemu-block@nongnu.org
Cc: kwolf@redhat.com, jsnow@redhat.com, den@openvz.org
Subject: Re: [Qemu-devel] [PATCH v4 for 2.12 0/3] fix bitmaps migration through shared storage
Date: Tue, 3 Apr 2018 18:03:43 +0200	[thread overview]
Message-ID: <d0d202f1-6384-178c-9dd7-989d5f06c48a@redhat.com> (raw)
In-Reply-To: <7be588fb-dd63-d6fd-0e39-81fbee1bdd25@virtuozzo.com>

On 2018-03-30 17:32, Vladimir Sementsov-Ogievskiy wrote:
> 30.03.2018 16:31, Vladimir Sementsov-Ogievskiy wrote:
>> 29.03.2018 18:09, Vladimir Sementsov-Ogievskiy wrote:
>>> 29.03.2018 17:03, Max Reitz wrote:
>>>> On 2018-03-29 10:08, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 28.03.2018 17:53, Max Reitz wrote:
>>>>>> On 2018-03-27 12:11, Vladimir Sementsov-Ogievskiy wrote:
>>>> [...]
>>>>
>>>>>>> isn't it because a lot of cat processes? will check, update loop to
>>>>>>> i=0; while check -qcow2 169; do ((i++)); echo $i OK; killall -9 cat;
>>>>>>> done
>>>>>> Hmm...  I know I tried to kill all of the cats, but for some
>>>>>> reason that
>>>>>> didn't really help yesterday.  Seems to help now, for 2.12.0-rc0 at
>>>>>> least (that is, before this series).
>>>>> reproduced with killing... (without these series, just on master)
>>>>>
>>>>>> After the whole series, I still get a lot of failures in 169
>>>>>> (mismatching bitmap hash, mostly).
>>>>>>
>>>>>> And interestingly, if I add an abort():
>>>>>>
>>>>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>>>>> index 486f3e83b7..9204c1c0ac 100644
>>>>>> --- a/block/qcow2.c
>>>>>> +++ b/block/qcow2.c
>>>>>> @@ -1481,6 +1481,7 @@ static int coroutine_fn
>>>>>> qcow2_do_open(BlockDriverState *bs, QDict *options,     }
>>>>>>
>>>>>>        if (bdrv_dirty_bitmap_next(bs, NULL)) {
>>>>>> +        abort();
>>>>>>            /* It's some kind of reopen with already existing dirty
>>>>>> bitmaps. There
>>>>>>             * are no known cases where we need loading bitmaps in
>>>>>> such
>>>>>> situation,
>>>>>>             * so it's safer don't load them.
>>>>>>
>>>>>> Then this fires for a couple of test cases of 169 even without the
>>>>>> third
>>>>>> patch of this series.
>>>>>>
>>>>>> I guess bdrv_dirty_bitmap_next() reacts to some bitmaps that
>>>>>> migration
>>>>>> adds or something?  Then this would be the wrong condition, because I
>>>>>> guess we still want to load the bitmaps that are in the qcow2 file.
>>>>>>
>>>>>> I'm not sure whether bdrv_has_readonly_bitmaps() is the correct
>>>>>> condition then, either, though.  Maybe let's take a step back: We
>>>>>> want
>>>>>> to load all the bitmaps from the file exactly once, and that is
>>>>>> when it
>>>>>> is opened the first time.  Or that's what I would have thought...  Is
>>>>>> that even correct?
>>>>>>
>>>>>> Why do we load the bitmaps when the device is inactive anyway?
>>>>>> Shouldn't we load them only once the device is activated?
>>>>> Hmm, not sure. May be, we don't need. But we anyway need to load them,
>>>>> when opening read-only, and we should correspondingly reopen in
>>>>> this case.
>>>> Yeah, well, yeah, but the current state seems just wrong. Apparently
>>>> there are cases where a qcow2 node may have bitmaps before we try to
>>>> load them from the file, so the current condition doesn't work.
>>>>
>>>> Furthermore, it seems like the current "state machine" is too
>>>> complex so
>>>> we don't know which cases are possible anymore and what to do when.
>>>>
>>>> So the first thing we could do is add a field to the BDRVQCow2State
>>>> that
>>>> tells us whether the bitmaps have been loaded already or not. If not,
>>>> we invoke qcow2_load_dirty_bitmaps() and set the value to true. If the
>>>> value was true already and the BDS is active and R/W now, we call
>>>> qcow2_reopen_bitmaps_rw_hint().  That should solve one problem.
>>>
>>> good idea, will do.
>>>
>>>>
>>>> The other problem of course is the question whether we should call
>>>> qcow2_load_dirty_bitmaps() at all while the drive is still inactive.
>>>> You know the migration model better than me, so I'm asking this
>>>> question
>>>> to you.  We can phrase it differently: Do we need to load the bitmaps
>>>> before the drive is activated?
>>>
>>> Now I think that we don't need. At least, we don't have such cases in
>>> Virtuozzo (I hope :).
>>>
>>> Why did I doubt:
>>>
>>> 1. We have cases, when we want to start vm as inactive, to be able to
>>> export it's drive as NBD export, push some data to it and then start
>>> the VM (which involves activating)
>>> 2. We have cases, when we want to start vm stopped and operate on
>>> dirty bitmaps.
>>>
>>> If just open all images in inactive mode until vm start, it looks
>>> like we need bitmaps in inactive mode (for 2.). But it looks like
>>> wrong approach anyway.
>>> Firstly, I tried to solve (1.) by simply inactivate_all() in case of
>>> start vm in paused mode, but it breaks at least (2.), so finally, I
>>> solved (1.) by an approach similar with "-incoming defer". So, we
>>> have inactive mode in two cases:
>>>  - incoming migration
>>>  - push data to vm before start
>>>
>>> and, in these cases, we don't need to load dirty-bitmaps.
>>>
>>> Also, inconsistency: now, we remove persistent bitmaps on inactivate.
>>> So, it is inconsistent to load the in inactive mode.
>>>
>>> Ok, I'll try to respin.
>>
>> finally, what cases we actually have for qcow2_do_open?
>>
>> 1. INACTIVE -> ACTIVE (through invalidate_cache, we obviously should
>> load bitmaps, if we decided that we have no persistent bitmaps in
>> INACTIVE mode)
>> 2. creating new bdrv state (first open of the image) in INACTIVE mode
>> (will not load bitmaps)
>> 3. creating new bdrv state (first open of the image) in ACTIVE mode
>> (will load bitmaps, maybe read-only if disk is RO)
>>
>> If only these three cases, it would be enough to just load bitmaps if
>> !INACTIVE and do nothing otherwise.

Maybe, but I'd prefer just adding a flag for now.  We can think about
whether we need bitmaps in inactive mode later, and even then managing
the state through a flag doesn't hurt.

>> Or, we have some of the following cases too?
>>
>> 1?. ACTIVE -> ACTIVE (through invalidate_cache, some kind of no-op, we
>> should not reload bitmaps)
>> 2?. RO -> RW (we should reopen_bitmaps_rw) (or it is possible only
>> through bdrv_reopen, which will not touch qcow2_do_open?
>> 3?. RW -> RO (reopen_bitmaps_ro ?)
>> ? something other??

I don't really know, but I just hope having a flag saves us from having
to think about all of this too much. :-)

>>> about 169, how often is it reproducible for you?

More errors than passes, I guess.

But it always leaves cats behind, and that's another sign that something
isn't right.

>> it becomes very interseting.
>>
>> persistent-migbitmap-online case failed on line 136. with mismathced
>> bitmap sha. This check is not relate to migration, on this line,
>> bitmap is already successfully migrated. But for some reason it is
>> corrupted after stop/start the VM. How is it possible - I can't
>> imagine. But it looks not really related to migration.. May it relate
>> to case, when postcopy was not finished or something like this?..
>> maybe fixing wait(RESUME) to something more appropriate will help. But
>> it is strange anyway.
>>
>> persistent-notmigbitmap-offline case failed on ine 128, which is
>> "self.vm_b.event_wait("RESUME", timeout=10.0)" with timeout.. can we
>> skip RESUME for some reason? maybe just move to MIGRATION event will
>> help, will check

It would still be interesting to know where the RESUME ended up...

> looks like moving from "RESUME" to "MIGRATION" events fixes the whole
> issue (I've already 740 successful iterations, which is good enough, but
> I will not stop the loop until at least 2000). I'll submit a patch. But
> I don't understand, why it fix sha256 mismatch, and why sha256-check
> fails with "RESUME" (check success after migration and fail after
> stop/start, o_O)...

Well, if you have the time, it might be worth investigating.

Max

      reply	other threads:[~2018-04-03 16:04 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-20 17:05 [Qemu-devel] [PATCH v4 for 2.12 0/3] fix bitmaps migration through shared storage Vladimir Sementsov-Ogievskiy
2018-03-20 17:05 ` [Qemu-devel] [PATCH v4 1/3] qcow2-bitmap: add qcow2_reopen_bitmaps_rw_hint() Vladimir Sementsov-Ogievskiy
2018-03-20 17:05 ` [Qemu-devel] [PATCH v4 2/3] qcow2: fix bitmaps loading when bitmaps already exist Vladimir Sementsov-Ogievskiy
2018-03-20 17:05 ` [Qemu-devel] [PATCH v4 2/3] qcow2: handle reopening bitmaps on bdrv_invalidate_cache Vladimir Sementsov-Ogievskiy
2018-03-20 17:07   ` [Qemu-devel] [PATCH v4 2/3] qcow2: handle reopening bitmaps on bdrv_invalidate_cache DROP IT Vladimir Sementsov-Ogievskiy
2018-03-20 17:05 ` [Qemu-devel] [PATCH v4 3/3] iotests: enable shared migration cases in 169 Vladimir Sementsov-Ogievskiy
2018-03-21 13:20 ` [Qemu-devel] [PATCH v4 for 2.12 0/3] fix bitmaps migration through shared storage Max Reitz
2018-03-26 18:06 ` Max Reitz
2018-03-27  9:28   ` Vladimir Sementsov-Ogievskiy
2018-03-27  9:53     ` Vladimir Sementsov-Ogievskiy
2018-03-27 10:11       ` Vladimir Sementsov-Ogievskiy
2018-03-28 14:53         ` Max Reitz
2018-03-29  8:08           ` Vladimir Sementsov-Ogievskiy
2018-03-29 14:03             ` Max Reitz
2018-03-29 15:09               ` Vladimir Sementsov-Ogievskiy
2018-03-30 13:31                 ` Vladimir Sementsov-Ogievskiy
2018-03-30 15:32                   ` Vladimir Sementsov-Ogievskiy
2018-04-03 16:03                     ` Max Reitz [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d0d202f1-6384-178c-9dd7-989d5f06c48a@redhat.com \
    --to=mreitz@redhat.com \
    --cc=den@openvz.org \
    --cc=jsnow@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=vsementsov@virtuozzo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.