qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
To: Kevin Wolf <kwolf@redhat.com>,
	Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: qemu-block@nongnu.org, qemu-devel@nongnu.org,
	s.reiter@proxmox.com, armbru@redhat.com, mreitz@redhat.com,
	stefanha@redhat.com, Thomas Lamprecht <t.lamprecht@proxmox.com>
Subject: Re: [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext
Date: Tue, 19 May 2020 18:48:42 +0300	[thread overview]
Message-ID: <b10dddf4-8889-c9e2-74db-d620e4fa5ca7@virtuozzo.com> (raw)
In-Reply-To: <20200519152959.GP7652@linux.fritz.box>

19.05.2020 18:29, Kevin Wolf wrote:
> Am 19.05.2020 um 17:05 hat Denis Plotnikov geschrieben:
>> On 19.05.2020 17:18, Kevin Wolf wrote:
>>> Am 19.05.2020 um 15:54 hat Denis Plotnikov geschrieben:
>>>>
>>>> On 19.05.2020 15:32, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 14.05.2020 17:26, Kevin Wolf wrote:
>>>>>> Am 14.05.2020 um 15:21 hat Thomas Lamprecht geschrieben:
>>>>>>> On 5/12/20 4:43 PM, Kevin Wolf wrote:
>>>>>>>> Stefan (Reiter), after looking a bit closer at this, I think
>>>>>>>> there is no
>>>>>>>> bug in QEMU, but the bug is in your coroutine code that calls block
>>>>>>>> layer functions without moving into the right AioContext first. I've
>>>>>>>> written this series anyway as it potentially makes the life of callers
>>>>>>>> easier and would probably make your buggy code correct.
>>>>>>>> However, it doesn't feel right to commit something like
>>>>>>>> patch 2 without
>>>>>>>> having a user for it. Is there a reason why you can't upstream your
>>>>>>>> async snapshot code?
>>>>>>> I mean I understand what you mean, but it would make the
>>>>>>> interface IMO so
>>>>>>> much easier to use, if one wants to explicit schedule it
>>>>>>> beforehand they
>>>>>>> can still do. But that would open the way for two styles doing
>>>>>>> things, not
>>>>>>> sure if this would seen as bad. The assert about from patch 3/3
>>>>>>> would be
>>>>>>> already really helping a lot, though.
>>>>>> I think patches 1 and 3 are good to be committed either way if people
>>>>>> think they are useful. They make sense without the async snapshot code.
>>>>>>
>>>>>> My concern with the interface in patch 2 is both that it could give
>>>>>> people a false sense of security and that it would be tempting to write
>>>>>> inefficient code.
>>>>>>
>>>>>> Usually, you won't have just a single call into the block layer for a
>>>>>> given block node, but you'll perform multiple operations. Switching to
>>>>>> the target context once rather than switching back and forth in every
>>>>>> operation is obviously more efficient.
>>>>>>
>>>>>> But chances are that even if one of these function is bdrv_flush(),
>>>>>> which now works correctly from a different thread, you might need
>>>>>> another function that doesn't implement the same magic. So you always
>>>>>> need to be aware which functions support cross-context calls which
>>>>>> ones don't.
>>>>>>
>>>>>> I feel we'd see a few bugs related to this.
>>>>>>
>>>>>>> Regarding upstreaming, there was some historical attempt to upstream it
>>>>>>> from Dietmar, but in the time frame of ~ 8 to 10 years ago or so.
>>>>>>> I'm not quite sure why it didn't went through then, I see if I can get
>>>>>>> some time searching the mailing list archive.
>>>>>>>
>>>>>>> We'd be naturally open and glad to upstream it, what it effectively
>>>>>>> allow us to do is to not block the VM to much during snapshoting it
>>>>>>> live.
>>>>>> Yes, there is no doubt that this is useful functionality. There has been
>>>>>> talk about this every now and then, but I don't think we ever got to a
>>>>>> point where it actually could be implemented.
>>>>>>
>>>>>> Vladimir, I seem to remember you (or someone else from your team?) were
>>>>>> interested in async snapshots as well a while ago?
>>>>> Den is working on this (add him to CC)
>>>> Yes, I was working on that.
>>>>
>>>> What I've done can be found here:
>>>> https://github.com/denis-plotnikov/qemu/commits/bgs_uffd
>>>>
>>>> The idea was to save a snapshot (state+ram) asynchronously to a separate
>>>> (raw) file using the existing infrastructure.
>>>> The goal of that was to reduce the VM downtime on snapshot.
>>>>
>>>> We decided to postpone this work until "userfaultfd write protected mode"
>>>> wasn't in the linux mainstream.
>>>> Now, userfaultfd-wp is merged to linux. So we have plans to continue this
>>>> work.
>>>>
>>>> According to the saving the "internal" snapshot to qcow2 I still have a
>>>> question. May be this is the right place and time to ask.
>>>>
>>>> If I remember correctly, in qcow2 the snapshot is stored at the end of
>>>> the address space of the current block-placement-table.
>>> Yes. Basically the way snapshots with VM state work is that you write
>>> the VM state to some offset after the end of the virtual disk, when the
>>> VM state is completely written you snapshot the current state (including
>>> both content of the virtual disk and VM state) and finally discard the
>>> VM state again in the active L1 table.
>>>
>>>> We switch to the new block-placement-table after the snapshot storing
>>>> is complete. In case of sync snapshot, we should switch the table
>>>> before the snapshot is written, another words, we should be able to
>>>> preallocate the the space for the snapshot and keep a link to the
>>>> space until snapshot writing is completed.
>>> I don't see a fundamental difference between sync and async in this
>>> respect. Why can't you write the VM state to the current L1 table first
>>> like we usually do?
>>
>> I'm not quite sure I understand the point.
>> Let's see all the picture of async snapshot: our goal is to minimize a VM
>> downtime during the snapshot.
>> When we do async snapshot we save vmstate except RAM when a VM is stopped
>> using the current L1 table (further initial L1 table). Then, we want the VM
>> start running
>> and write RAM content. At this time all RAM is write-protected.
>> We unprotect each RAM page once it has been written.
> 
> Oh, I see, you're basically doing something like postcopy migration. I
> was assuming it was more like regular live migration, except that you
> would overwrite updated RAM blocks instead of appending them.
> 
> I can see your requirement then.
> 
>> All those RAM pages should go to the snapshot using the initial L1 table.
>> Since the VM is running, it may want to write new disk blocks,
>> so we need to use a NEW L1 table to provide this ability. (Am I correct so
>> far?)
>> Thus, if I understand correctly, we need to use two L1 tables: the initial
>> one to store RAM pages
>> to the vmstate and the new one to allow disk writings.
>>
>> May be I can't see a better way to achieve that. Please, correct me if I'm
>> wrong.
> 
> I guess I could imagine a different, though probably not better way: We
> could internally have a separate low-level operation that moves the VM
> state from the active layer to an already existing disk snapshot. Then
> you would snapshot the disk and start writing the VM to the active
> layer, and when the VM state write has completed you move it into the
> snapshot.
> 
> The other options is doing what you suggested. There is nothing in the
> qcow2 on-disk format that would prevent this, but we would have to
> extend the qcow2 driver to allow I/O to inactive L1 tables. This sounds
> like a non-trivial amount of code changes, though it would potentially
> enable more use cases we never implemented ((read-only) access to
> internal snapshots as block nodes, so you could e.g. use block jobs to
> export a snapshot).

Or export a snapshot through NBD.

Still, I have one more idea, probably we already discussed it?
Honestly, I don't like the fact that we store vmstate into guest-data space. After EOF, invisible, but still..

Maybe, it would be good to make a qcow2 extension for storing vmstate separately? So snapshot metadata will include two more fields: vmstate_offset and vmstate_length (hmm, actually we already have the second one), which will be allocated as normal qcow2 metadata? Or we can add one-two levels of layered allocation if needed, but keep it separate from L1/L2 tables for guest clusters.


-- 
Best regards,
Vladimir


  reply	other threads:[~2020-05-19 15:51 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-12 14:43 [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext Kevin Wolf
2020-05-12 14:43 ` [RFC PATCH 1/3] block: Factor out bdrv_run_co() Kevin Wolf
2020-05-12 15:37   ` Eric Blake
2020-05-20  9:09     ` Philippe Mathieu-Daudé
2020-05-20 11:14       ` Vladimir Sementsov-Ogievskiy
2020-05-12 14:43 ` [RFC PATCH 2/3] block: Allow bdrv_run_co() from different AioContext Kevin Wolf
2020-05-12 16:02   ` Thomas Lamprecht
2020-05-12 19:29     ` Kevin Wolf
2020-05-25 14:18   ` Stefan Reiter
2020-05-25 16:41     ` Kevin Wolf
2020-05-26 16:42       ` Kevin Wolf
2020-05-27  8:56         ` Stefan Reiter
2020-05-12 14:43 ` [RFC PATCH 3/3] block: Assert we're running in the right thread Kevin Wolf
2020-05-14 13:52   ` Stefan Reiter
2020-05-14 14:30     ` Kevin Wolf
2020-05-20  9:12       ` Philippe Mathieu-Daudé
2020-05-14 13:21 ` [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext Thomas Lamprecht
2020-05-14 14:26   ` Kevin Wolf
2020-05-19 12:32     ` Vladimir Sementsov-Ogievskiy
2020-05-19 13:54       ` Denis Plotnikov
2020-05-19 14:18         ` Kevin Wolf
2020-05-19 15:05           ` Denis Plotnikov
2020-05-19 15:29             ` Kevin Wolf
2020-05-19 15:48               ` Vladimir Sementsov-Ogievskiy [this message]
2020-05-19 16:06                 ` Eric Blake
2020-05-20  7:23               ` Denis Plotnikov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b10dddf4-8889-c9e2-74db-d620e4fa5ca7@virtuozzo.com \
    --to=vsementsov@virtuozzo.com \
    --cc=armbru@redhat.com \
    --cc=dplotnikov@virtuozzo.com \
    --cc=kwolf@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=s.reiter@proxmox.com \
    --cc=stefanha@redhat.com \
    --cc=t.lamprecht@proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).