Re: [PATCH v4 3/6] support UFFD write fault processing in ram_save_iterate()

From: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
To: Peter Xu <peterx@redhat.com>
Cc: qemu-devel@nongnu.org, Den Lunev <den@openvz.org>,
	Eric Blake <eblake@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Juan Quintela <quintela@redhat.com>,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>,
	Markus Armbruster <armbru@redhat.com>
Subject: Re: [PATCH v4 3/6] support UFFD write fault processing in ram_save_iterate()
Date: Mon, 30 Nov 2020 22:27:56 +0300	[thread overview]
Message-ID: <2e89b0f9-d6c1-8fb2-4dfb-0eaf85e0bdbf@virtuozzo.com> (raw)
In-Reply-To: <20201130163221.GB3277@xz-x1>

On 30.11.2020 19:32, Peter Xu wrote:
> On Mon, Nov 30, 2020 at 12:14:03AM +0300, Andrey Gruzdev wrote:
>>>> +#ifdef CONFIG_LINUX
>>>> +/**
>>>> + * ram_find_block_by_host_address: find RAM block containing host page
>>>> + *
>>>> + * Returns pointer to RAMBlock if found, NULL otherwise
>>>> + *
>>>> + * @rs: current RAM state
>>>> + * @page_address: host page address
>>>> + */
>>>> +static RAMBlock *ram_find_block_by_host_address(RAMState *rs, hwaddr page_address)
>>>
>>> Reuse qemu_ram_block_from_host() somehow?
>>>
>>
>> Seems not very suitable here, since we use rs->last_seen_block to restart
>> search..
> 
> The search logic is actually still the same, it's just about which block to
> start searching (rs->last_seen_block, ram_list.mru_block, or the 1st one).  So
> an internal helper to pass in that information would be nice.  Though that'll
> require rcu lock taken before hand to keep the ramblock ptr valid.
> 
> No worry - we can keep this and work on top too, I think.
> 
> [...]
>  >>>> +static RAMBlock *poll_fault_page(RAMState *rs, ram_addr_t *offset)
>>>> +{
>>>> +    struct uffd_msg uffd_msg;
>>>> +    hwaddr page_address;
>>>> +    RAMBlock *bs;
>>>> +    int res;
>>>> +
>>>> +    if (!migrate_background_snapshot()) {
>>>> +        return NULL;
>>>> +    }
>>>> +
>>>> +    res = uffd_read_events(rs->uffdio_fd, &uffd_msg, 1);
>>>> +    if (res <= 0) {
>>>> +        return NULL;
>>>> +    }
>>>> +
>>>> +    page_address = uffd_msg.arg.pagefault.address;
>>>> +    bs = ram_find_block_by_host_address(rs, page_address);
>>>> +    if (!bs) {
>>>> +        /* In case we couldn't find respective block, just unprotect faulting page. */
>>>> +        uffd_protect_memory(rs->uffdio_fd, page_address, TARGET_PAGE_SIZE, false);
>>>> +        error_report("ram_find_block_by_host_address() failed: address=0x%0lx",
>>>> +                page_address);
>>>
>>> Looks ok to error_report() instead of assert(), but I'll suggest drop the call
>>> to uffd_protect_memory() at least.  The only reason to not use assert() is
>>> because we try our best to avoid crashing the vm, however I really doubt
>>> whether uffd_protect_memory() is the right thing to do even if it happens - we
>>> may at last try to unprotect some strange pages that we don't even know where
>>> it is...
>>>
>>
>> IMHO better to unprotect these strange pages then to leave them protected by
>> UFFD.. To avoid getting VM completely in-operational.
>> At least we know the page generated wr-fault, maybe due to incorrect
>> write-tracking initialization, or RAMBlock somehow has gone. Nevertheless if
>> leave the page as is, VM would certainly lock.
> 
> Yes makes some senes too.  However it's definitely a severe programming error,
> even if the VM is unblocked, it can be in a very weird state...
> 
> Maybe we just assert? Then we can see how unlucky we'll be. :)
> 

Yeah, seems assert is a right thing here :)

>>
>> Hmm, I wonder about assert(). In QEMU it would do something in release
>> builds?
> 
> I guess yes, at least for now.  Because osdep.h has this:
> 
> /*
>   * We have a lot of unaudited code that may fail in strange ways, or
>   * even be a security risk during migration, if you disable assertions
>   * at compile-time.  You may comment out these safety checks if you
>   * absolutely want to disable assertion overhead, but it is not
>   * supported upstream so the risk is all yours.  Meanwhile, please
>   * submit patches to remove any side-effects inside an assertion, or
>   * fixing error handling that should use Error instead of assert.
>   */
> #ifdef NDEBUG
> #error building with NDEBUG is not supported
> #endif
> #ifdef G_DISABLE_ASSERT
> #error building with G_DISABLE_ASSERT is not supported
> #endif
> 
> [...]
> 

Ooops, didn't know that, thanks for info!

>>>> +/**
>>>> + * ram_save_host_page_post: ram_save_host_page() post-notifier
>>>> + *
>>>> + * @rs: current RAM state
>>>> + * @pss: page-search-status structure
>>>> + * @opaque: opaque context value
>>>> + * @res_override: pointer to the return value of ram_save_host_page(),
>>>> + *   overwritten in case of an error
>>>> + */
>>>> +static void ram_save_host_page_post(RAMState *rs, PageSearchStatus *pss,
>>>> +        void *opaque, int *res_override)
>>>> +{
>>>> +    /* Check if page is from UFFD-managed region. */
>>>> +    if (pss->block->flags & RAM_UF_WRITEPROTECT) {
>>>> +#ifdef CONFIG_LINUX
>>>> +        ram_addr_t page_from = (ram_addr_t) opaque;
>>>> +        hwaddr page_address = (hwaddr) pss->block->host +
>>>> +                              ((hwaddr) page_from << TARGET_PAGE_BITS);
>>>
>>> I feel like most new uses of hwaddr is not correct...  As I also commented in
>>> the other patch.  We should save a lot of castings if switched.
>>>
>>
>> Need to check. hwaddr is typedef'ed as uint64_t while ram_addr_t as
>> uintptr_t. I any case UFFD interface relies on u64 type.
> 
> For example, page_from should be a host address, we can use unsigned long, or
> uint64_t, but ram_addr_t is not proper, which is only used in ramblock address
> space.
> 
> I think it's fine that e.g. we use common types like uint64_t directly.
> 

Now I also think we'd better use common types - mostly uint64_t 
directly, not to confuse anybody with that specific type descriptions.

>>
>>>> +        hwaddr run_length = (hwaddr) (pss->page - page_from + 1) << TARGET_PAGE_BITS;
>>>> +        int res;
>>>> +
>>>> +        /* Flush async buffers before un-protect. */
>>>> +        qemu_fflush(rs->f);
>>>> +        /* Un-protect memory range. */
>>>> +        res = uffd_protect_memory(rs->uffdio_fd, page_address, run_length, false);
>>>> +        /* We don't want to override existing error from ram_save_host_page(). */
>>>> +        if (res < 0 && *res_override >= 0) {
>>>> +            *res_override = res;
>>>
>>> What is this used for?  If res<0, I thought we should just fail the snapshot.
>>>
>>> Meanwhile, res_override points to "pages", and then it'll be rewrite to the
>>> errno returned by uffd_protect_memory().  Smells strange.
>>>
>>> Can this ever be triggered anyway?
>>>
>>
>> Yes, since "pages" is also for error return, if negative. If we have a
>> problem with un-protecting, promote the error to the loop in
>> ram_find_and_save_block() so it exits early ("pages" guaranteed to be
>> non-zero). In outer routines retcode would go to qemu_set_file_error().
> 
> Ok I see your point.
> 
>>
>>>> +        }
>>>> +#else
>>>> +        /* Should never happen */
>>>> +        qemu_file_set_error(rs->f, -ENOSYS);
>>>> +#endif /* CONFIG_LINUX */
>>>> +    }
>>>> +}
>>>> +
>>>>    /**
>>>>     * ram_find_and_save_block: finds a dirty page and sends it to f
>>>>     *
>>>> @@ -1779,14 +1908,14 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
>>>>            return pages;
>>>>        }
>>>> +    if (!rs->last_seen_block) {
>>>> +        rs->last_seen_block = QLIST_FIRST_RCU(&ram_list.blocks);
>>>
>>> Why setup the last seen block to be the first if null?
>>>
>>
>> Because ram_find_block_by_host_address() relies on rs->last_seen_block.
>> Pss is not passed to that routine.
> 
> Ok, but it's odd - because that's not the real "last seen block".
> 
> So now I'm rethinking maybe we could still reuse qemu_ram_block_from_host() by
> providing a new helper to do the search logic as mentioned above, which could
> take a pointer of RAMBlock as the 1st ramblock to search.  Then
> ram_find_block_by_host_address() should pass in rs->last_seen_block (which
> could be NULL - then we'll start with the 1st ramblock), or ram_list.mru_block
> for qemu_ram_block_from_host().  That could be a standalone new patch, then we
> don't need this slightly strange change.  What do you think?
> 

Mm, need to think about that..

>>
>>>> +    }
>>>> +
>>>>        pss.block = rs->last_seen_block;
>>>>        pss.page = rs->last_page;
>>>>        pss.complete_round = false;
>>>> -    if (!pss.block) {
>>>> -        pss.block = QLIST_FIRST_RCU(&ram_list.blocks);
>>>> -    }
>>>> -
>>>>        do {
>>>>            again = true;
>>>>            found = get_queued_page(rs, &pss);
>>>> @@ -1797,7 +1926,11 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
>>>>            }
>>>>            if (found) {
>>>> +            void *opaque;
>>>> +
>>>> +            ram_save_host_page_pre(rs, &pss, &opaque);
>>>>                pages = ram_save_host_page(rs, &pss, last_stage);
>>>> +            ram_save_host_page_post(rs, &pss, opaque, &pages);
>>>
>>> So the pre/post idea is kind of an overkill to me...
>>>
>>> How about we do the unprotect in ram_save_host_page() in the simple way, like:
>>>
>>>     ram_save_host_page()
>>>       start_addr = pss->page;
>>>       do {
>>>         ...
>>>         (migrate pages)
>>>         ...
>>>       } while (...);
>>>       if (background_snapshot_enabled()) {
>>>         unprotect pages within start_addr to pss->page;
>>>       }
>>>       ...
>>>
>>
>> Personally I prefer not to modify existing code. May be adding to simple
>> calls would finally look cleaner?
> 
> Let's wait a 3rd opinion from Juan/Dave or others.
> 
> Thanks,
> 

Agree.

Thanks,

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com