Re: [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext - Vladimir Sementsov-Ogievskiy

From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
To: Kevin Wolf <kwolf@redhat.com>,
	Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: qemu-block@nongnu.org, qemu-devel@nongnu.org,
	s.reiter@proxmox.com, armbru@redhat.com, mreitz@redhat.com,
	stefanha@redhat.com, Thomas Lamprecht <t.lamprecht@proxmox.com>
Subject: Re: [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext
Date: Tue, 19 May 2020 18:48:42 +0300	[thread overview]
Message-ID: <b10dddf4-8889-c9e2-74db-d620e4fa5ca7@virtuozzo.com> (raw)
In-Reply-To: <20200519152959.GP7652@linux.fritz.box>

19.05.2020 18:29, Kevin Wolf wrote:
> Am 19.05.2020 um 17:05 hat Denis Plotnikov geschrieben:
>> On 19.05.2020 17:18, Kevin Wolf wrote:
>>> Am 19.05.2020 um 15:54 hat Denis Plotnikov geschrieben:
>>>>
>>>> On 19.05.2020 15:32, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 14.05.2020 17:26, Kevin Wolf wrote:
>>>>>> Am 14.05.2020 um 15:21 hat Thomas Lamprecht geschrieben:
>>>>>>> On 5/12/20 4:43 PM, Kevin Wolf wrote:
>>>>>>>> Stefan (Reiter), after looking a bit closer at this, I think
>>>>>>>> there is no
>>>>>>>> bug in QEMU, but the bug is in your coroutine code that calls block
>>>>>>>> layer functions without moving into the right AioContext first. I've
>>>>>>>> written this series anyway as it potentially makes the life of callers
>>>>>>>> easier and would probably make your buggy code correct.
>>>>>>>> However, it doesn't feel right to commit something like
>>>>>>>> patch 2 without
>>>>>>>> having a user for it. Is there a reason why you can't upstream your
>>>>>>>> async snapshot code?
>>>>>>> I mean I understand what you mean, but it would make the
>>>>>>> interface IMO so
>>>>>>> much easier to use, if one wants to explicit schedule it
>>>>>>> beforehand they
>>>>>>> can still do. But that would open the way for two styles doing
>>>>>>> things, not
>>>>>>> sure if this would seen as bad. The assert about from patch 3/3
>>>>>>> would be
>>>>>>> already really helping a lot, though.
>>>>>> I think patches 1 and 3 are good to be committed either way if people
>>>>>> think they are useful. They make sense without the async snapshot code.
>>>>>>
>>>>>> My concern with the interface in patch 2 is both that it could give
>>>>>> people a false sense of security and that it would be tempting to write
>>>>>> inefficient code.
>>>>>>
>>>>>> Usually, you won't have just a single call into the block layer for a
>>>>>> given block node, but you'll perform multiple operations. Switching to
>>>>>> the target context once rather than switching back and forth in every
>>>>>> operation is obviously more efficient.
>>>>>>
>>>>>> But chances are that even if one of these function is bdrv_flush(),
>>>>>> which now works correctly from a different thread, you might need
>>>>>> another function that doesn't implement the same magic. So you always
>>>>>> need to be aware which functions support cross-context calls which
>>>>>> ones don't.
>>>>>>
>>>>>> I feel we'd see a few bugs related to this.
>>>>>>
>>>>>>> Regarding upstreaming, there was some historical attempt to upstream it
>>>>>>> from Dietmar, but in the time frame of ~ 8 to 10 years ago or so.
>>>>>>> I'm not quite sure why it didn't went through then, I see if I can get
>>>>>>> some time searching the mailing list archive.
>>>>>>>
>>>>>>> We'd be naturally open and glad to upstream it, what it effectively
>>>>>>> allow us to do is to not block the VM to much during snapshoting it
>>>>>>> live.
>>>>>> Yes, there is no doubt that this is useful functionality. There has been
>>>>>> talk about this every now and then, but I don't think we ever got to a
>>>>>> point where it actually could be implemented.
>>>>>>
>>>>>> Vladimir, I seem to remember you (or someone else from your team?) were
>>>>>> interested in async snapshots as well a while ago?
>>>>> Den is working on this (add him to CC)
>>>> Yes, I was working on that.
>>>>
>>>> What I've done can be found here:
>>>> https://github.com/denis-plotnikov/qemu/commits/bgs_uffd
>>>>
>>>> The idea was to save a snapshot (state+ram) asynchronously to a separate
>>>> (raw) file using the existing infrastructure.
>>>> The goal of that was to reduce the VM downtime on snapshot.
>>>>
>>>> We decided to postpone this work until "userfaultfd write protected mode"
>>>> wasn't in the linux mainstream.
>>>> Now, userfaultfd-wp is merged to linux. So we have plans to continue this
>>>> work.
>>>>
>>>> According to the saving the "internal" snapshot to qcow2 I still have a
>>>> question. May be this is the right place and time to ask.
>>>>
>>>> If I remember correctly, in qcow2 the snapshot is stored at the end of
>>>> the address space of the current block-placement-table.
>>> Yes. Basically the way snapshots with VM state work is that you write
>>> the VM state to some offset after the end of the virtual disk, when the
>>> VM state is completely written you snapshot the current state (including
>>> both content of the virtual disk and VM state) and finally discard the
>>> VM state again in the active L1 table.
>>>
>>>> We switch to the new block-placement-table after the snapshot storing
>>>> is complete. In case of sync snapshot, we should switch the table
>>>> before the snapshot is written, another words, we should be able to
>>>> preallocate the the space for the snapshot and keep a link to the
>>>> space until snapshot writing is completed.
>>> I don't see a fundamental difference between sync and async in this
>>> respect. Why can't you write the VM state to the current L1 table first
>>> like we usually do?
>>
>> I'm not quite sure I understand the point.
>> Let's see all the picture of async snapshot: our goal is to minimize a VM
>> downtime during the snapshot.
>> When we do async snapshot we save vmstate except RAM when a VM is stopped
>> using the current L1 table (further initial L1 table). Then, we want the VM
>> start running
>> and write RAM content. At this time all RAM is write-protected.
>> We unprotect each RAM page once it has been written.
> 
> Oh, I see, you're basically doing something like postcopy migration. I
> was assuming it was more like regular live migration, except that you
> would overwrite updated RAM blocks instead of appending them.
> 
> I can see your requirement then.
> 
>> All those RAM pages should go to the snapshot using the initial L1 table.
>> Since the VM is running, it may want to write new disk blocks,
>> so we need to use a NEW L1 table to provide this ability. (Am I correct so
>> far?)
>> Thus, if I understand correctly, we need to use two L1 tables: the initial
>> one to store RAM pages
>> to the vmstate and the new one to allow disk writings.
>>
>> May be I can't see a better way to achieve that. Please, correct me if I'm
>> wrong.
> 
> I guess I could imagine a different, though probably not better way: We
> could internally have a separate low-level operation that moves the VM
> state from the active layer to an already existing disk snapshot. Then
> you would snapshot the disk and start writing the VM to the active
> layer, and when the VM state write has completed you move it into the
> snapshot.
> 
> The other options is doing what you suggested. There is nothing in the
> qcow2 on-disk format that would prevent this, but we would have to
> extend the qcow2 driver to allow I/O to inactive L1 tables. This sounds
> like a non-trivial amount of code changes, though it would potentially
> enable more use cases we never implemented ((read-only) access to
> internal snapshots as block nodes, so you could e.g. use block jobs to
> export a snapshot).

Or export a snapshot through NBD.

Still, I have one more idea, probably we already discussed it?
Honestly, I don't like the fact that we store vmstate into guest-data space. After EOF, invisible, but still..

Maybe, it would be good to make a qcow2 extension for storing vmstate separately? So snapshot metadata will include two more fields: vmstate_offset and vmstate_length (hmm, actually we already have the second one), which will be allocated as normal qcow2 metadata? Or we can add one-two levels of layered allocation if needed, but keep it separate from L1/L2 tables for guest clusters.

-- 
Best regards,
Vladimir