Re: [PATCH v3 3/7] support UFFD write fault processing in ram_save_iterate()

From: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
To: Peter Xu <peterx@redhat.com>
Cc: qemu-devel@nongnu.org, Den Lunev <den@openvz.org>,
	Eric Blake <eblake@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Juan Quintela <quintela@redhat.com>,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>,
	Markus Armbruster <armbru@redhat.com>
Subject: Re: [PATCH v3 3/7] support UFFD write fault processing in ram_save_iterate()
Date: Tue, 24 Nov 2020 20:40:00 +0300	[thread overview]
Message-ID: <71c30a1e-8fbd-1eb3-aa5f-2efc973c0647@virtuozzo.com> (raw)
In-Reply-To: <20201124151729.GA257984@xz-x1>

On 24.11.2020 18:17, Peter Xu wrote:
> On Tue, Nov 24, 2020 at 11:02:09AM +0300, Andrey Gruzdev wrote:
>> On 24.11.2020 00:34, Peter Xu wrote:
>>> On Fri, Nov 20, 2020 at 07:53:34PM +0300, Andrey Gruzdev wrote:
>>>> On 20.11.2020 19:43, Peter Xu wrote:
>>>>> On Fri, Nov 20, 2020 at 07:15:07PM +0300, Andrey Gruzdev wrote:
>>>>>> Yeah, I think we can re-use the postcopy queue code for faulting pages. I'm
>>>>>> worring a little about some additional overhead dealing with urgent request
>>>>>> semaphore. Also, the code won't change a lot, something like:
>>>>>>
>>>>>> [...]
>>>>>>            /* In case of 'write-tracking' migration we first try
>>>>>>             * to poll UFFD and sse if we have write page fault event */
>>>>>>            poll_fault_page(rs);
>>>>>>
>>>>>>            again = true;
>>>>>>            found = get_queued_page(rs, &pss);
>>>>>>
>>>>>>            if (!found) {
>>>>>>                /* priority queue empty, so just search for something dirty */
>>>>>>                found = find_dirty_block(rs, &pss, &again);
>>>>>>            }
>>>>>> [...]
>>>>>
>>>>> Could I ask what's the "urgent request semaphore"?  Thanks,
>>>>>
>>>>
>>>> These function use it (the correct name is 'rate_limit_sem'):
>>>>
>>>> void migration_make_urgent_request(void)
>>>> {
>>>>       qemu_sem_post(&migrate_get_current()->rate_limit_sem);
>>>> }
>>>>
>>>> void migration_consume_urgent_request(void)
>>>> {
>>>>       qemu_sem_wait(&migrate_get_current()->rate_limit_sem);
>>>> }
>>>>
>>>> They are called from ram_save_queue_pages and unqueue_page, accordingly, to
>>>> control migration rate limiter.
>>>>
>>>> bool migration_rate_limit(void)
>>>> {
>>>> [...]
>>>>           /*
>>>>            * Wait for a delay to do rate limiting OR
>>>>            * something urgent to post the semaphore.
>>>>            */
>>>>           int ms = s->iteration_start_time + BUFFER_DELAY - now;
>>>>           trace_migration_rate_limit_pre(ms);
>>>>           if (qemu_sem_timedwait(&s->rate_limit_sem, ms) == 0) {
>>>>               /*
>>>>                * We were woken by one or more urgent things but
>>>>                * the timedwait will have consumed one of them.
>>>>                * The service routine for the urgent wake will dec
>>>>                * the semaphore itself for each item it consumes,
>>>>                * so add this one we just eat back.
>>>>                */
>>>>               qemu_sem_post(&s->rate_limit_sem);
>>>>               urgent = true;
>>>>           }
>>>> [...]
>>>> }
>>>>
>>>
>>> Hmm... Why its overhead could be a problem?  If it's an overhead that can be
>>> avoided, then postcopy might want that too.
>>>
>>> The thing is I really feel like the snapshot logic can leverage the whole idea
>>> of existing postcopy (like get_queued_page/unqueue_page; it's probably due to
>>> the fact that both of them want to "migrate some more urgent pages than the
>>> background migration, due to either missing/wrprotected pages"), but I might
>>> have something missing.
>>>
>>> Thanks,
>>>
>>
>> I don't think this overhead itself is a problem since its negligible
>> compared to other stuff.. You're undoubtly correct about using postcopy idea
>> in case when wr-fault pages arrive from the separate thread or external
>> source. Then we really need to decouple that separate thread or external
>> requestor from the migration thread.
>>
>> In this patch series wr-faults arrive in the same migration loop with normal
>> scan, so I don't see direct reason to pass it through the queue, but I might
>> have missed something from your proposal.
> 
> I see your point.  For me whether using a thread is not extremely important -
> actually we can have a standalone thread to handle the vcpu faults too just
> like postcopy; it's just run on the src host.  My major point is that we should
> avoid introducing the extra pss logic because fundamentally it's doing the same
> thing as get_queued_page() right now, unless I'm wrong...
> 
> So I think your previous poll_fault_page() call is okay; assuming the same poll
> model as you used, I mean something like this:
> 
> ------8<-------
> diff --git a/migration/ram.c b/migration/ram.c
> index 7811cde643..718dd276c7 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1473,6 +1473,14 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss)
>   
>       } while (block && !dirty);
>   
> +    if (!block) {
> +        /*
> +         * Set the block too if live snapshot is enabled; that's when we have
> +         * vcpus got blocked by the wr-protected pages.
> +         */
> +        block = poll_fault_page(rs, &offset);
> +    }
> +
>       if (block) {
>           /*
>            * As soon as we start servicing pages out of order, then we have
> ------8<-------
> 
> So as long as we set the block and offset, pss will be updated just like
> postcopy.  That's exactly what we want, am I right?
> 

Also think the best place to put polling call here, not to touch pss yet 
again.

>>
>> Do you mean that if we use postcopy logic, we'll allocate memory and make
>> copies of faulting pages and then immediately un-protect them?
>> Or we just put faulting page address to the queued item and release
>> protection after page content has been saved.
> 
> I think current approach would be fine, so we don't copy page and unprotect at
> a single place after the page is dumped to the precopy stream.  If you could
> measure the latencies then it'll be even better, then we'll know what to expect.
> 
> IIRC the "copy page first" idea came from Denis's initial version, in which I
> thought as too aggresive since potentially it can eat twice the memory on the
> host for the single guest, not to mention when live snapshot is taken for
> mutliple guests on the same host.  It's just not something that can be directly
> expected when the user wants to take a snapshot.  That's something we can do
> upon this work, imho, if we'll have very high latency on resolving the page
> faults. But that's still the last thing to try (or at least an "by default off"
> option) so we can still think about some other solution before trying to eat
> too much host mem.
> 
> Thanks,
> 

I totally agree that creating shadow copies for each faulting page is 
not a very robust idea.. Cause we could never know the 'proper' limit 
for the buffer memory we should allocate and cut from the host, it 
depends on hardware, guest workload, host I/O intensity, too many things.

So I believe file/block backend side is a better place to provide some 
buffering and keep memory/latency tradeoff. Also it allows to be less 
'invasive' to QEMU code itself.

Thanks,

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com