Re: [PATCH V4 15/24] xen/arm: Stick around in leave_hypervisor_to_guest until I/O has completed

From: Oleksandr <olekstysh@gmail.com>
To: Julien Grall <julien@xen.org>
Cc: xen-devel@lists.xenproject.org,
	Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>,
	Stefano Stabellini <sstabellini@kernel.org>,
	Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>,
	Julien Grall <julien.grall@arm.com>
Subject: Re: [PATCH V4 15/24] xen/arm: Stick around in leave_hypervisor_to_guest until I/O has completed
Date: Sun, 17 Jan 2021 22:23:21 +0200	[thread overview]
Message-ID: <e57ca0f8-f92f-1ba3-d874-4265e1031542@gmail.com> (raw)
In-Reply-To: <1f1f910b-ebef-f071-3458-12ad493d6e79@xen.org>

On 15.01.21 22:55, Julien Grall wrote:
> Hi Oleksandr,

Hi Julien

>
> On 12/01/2021 21:52, Oleksandr Tyshchenko wrote:
>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>
>> This patch adds proper handling of return value of
>> vcpu_ioreq_handle_completion() which involves using a loop in
>> leave_hypervisor_to_guest().
>>
>> The reason to use an unbounded loop here is the fact that vCPU shouldn't
>> continue until the I/O has completed.
>>
>> The IOREQ code is using wait_on_xen_event_channel(). Yet, this can
>> still "exit" early if an event has been received. But this doesn't mean
>> the I/O has completed (in can be just a spurious wake-up).
>
> While I agree we need the loop, I don't think the reason is correct 
> here. If you receive a spurious event, then the loop in wait_for_io() 
> will catch it.
>
> The only way to get out of that loop is if the I/O has been handled or 
> the state in the IOREQ page is invalid.
>
> In addition to that, handle_hvm_io_completion(), will only return 
> false if the state is invalid or there is vCPI work to do.

Agree, update description.

>
>
>> So we need
>> to check if the I/O has completed and wait again if it hasn't (we will
>> block the vCPU again until an event is received). This loop makes sure
>> that all the vCPU works are done before we return to the guest.
>>
>> The call chain below:
>> check_for_vcpu_work -> vcpu_ioreq_handle_completion -> wait_for_io ->
>> wait_on_xen_event_channel
>>
>> The worse that can happen here if the vCPU will never run again
>> (the I/O will never complete). But, in Xen case, if the I/O never
>> completes then it most likely means that something went horribly
>> wrong with the Device Emulator. And it is most likely not safe
>> to continue. So letting the vCPU to spin forever if the I/O never
>> completes is a safer action than letting it continue and leaving
>> the guest in unclear state and is the best what we can do for now.
>>
>> Please note, using this loop we will not spin forever on a pCPU,
>> preventing any other vCPUs from being scheduled. At every loop
>> we will call check_for_pcpu_work() that will process pending
>> softirqs. In case of failure, the guest will crash and the vCPU
>> will be unscheduled. In normal case, if the rescheduling is necessary
>> (might be set by a timer or by a caller in check_for_vcpu_work(),
>> where wait_for_io() is a preemption point) the vCPU will be rescheduled
>> to give place to someone else.
>>
> What you describe here is a bug that was introduced by this series. If 
> you think the code requires a separate patch, then please split off 
> patch #14 so the code callling vcpu_ioreq_handle_completion() happen here.
I am afraid, I don't understand which bug you are talking about, I just 
tried to explain why using a loop is not bad (there wouldn't be any 
impact to other vCPUs, etc) and the worse case which could happen.
Also I don't see a reason why the code requires a separate patch 
(probably, if I understood a bug I would see a reason ...) Could you 
please clarify?

>
>
>> Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>
>
>> CC: Julien Grall <julien.grall@arm.com>
>> [On Arm only]
>> Tested-by: Wei Chen <Wei.Chen@arm.com>
>>
>> ---
>> Please note, this is a split/cleanup/hardening of Julien's PoC:
>> "Add support for Guest IO forwarding to a device emulator"
>>
>> Changes V1 -> V2:
>>     - new patch, changes were derived from (+ new explanation):
>>       arm/ioreq: Introduce arch specific bits for IOREQ/DM features
>>
>> Changes V2 -> V3:
>>     - update patch description
>>
>> Changes V3 -> V4:
>>     - update patch description and comment in code
>> ---
>>   xen/arch/arm/traps.c | 38 +++++++++++++++++++++++++++++++++-----
>>   1 file changed, 33 insertions(+), 5 deletions(-)
>>
>> diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
>> index 036b13f..4a83e1e 100644
>> --- a/xen/arch/arm/traps.c
>> +++ b/xen/arch/arm/traps.c
>> @@ -2257,18 +2257,23 @@ static void check_for_pcpu_work(void)
>>    * Process pending work for the vCPU. Any call should be fast or
>>    * implement preemption.
>>    */
>> -static void check_for_vcpu_work(void)
>> +static bool check_for_vcpu_work(void)
>>   {
>>       struct vcpu *v = current;
>>     #ifdef CONFIG_IOREQ_SERVER
>> +    bool handled;
>> +
>>       local_irq_enable();
>> -    vcpu_ioreq_handle_completion(v);
>> +    handled = vcpu_ioreq_handle_completion(v);
>>       local_irq_disable();
>> +
>> +    if ( !handled )
>> +        return true;
>>   #endif
>>         if ( likely(!v->arch.need_flush_to_ram) )
>> -        return;
>> +        return false;
>>         /*
>>        * Give a chance for the pCPU to process work before handling 
>> the vCPU
>> @@ -2279,6 +2284,8 @@ static void check_for_vcpu_work(void)
>>       local_irq_enable();
>>       p2m_flush_vm(v);
>>       local_irq_disable();
>> +
>> +    return false;
>>   }
>>     /*
>> @@ -2291,8 +2298,29 @@ void leave_hypervisor_to_guest(void)
>>   {
>>       local_irq_disable();
>>   -    check_for_vcpu_work();
>> -    check_for_pcpu_work();
>> +    /*
>> +     * The reason to use an unbounded loop here is the fact that vCPU
>> +     * shouldn't continue until the I/O has completed.
>> +     *
>> +     * The worse that can happen here if the vCPU will never run again
>> +     * (the I/O will never complete). But, in Xen case, if the I/O 
>> never
>> +     * completes then it most likely means that something went horribly
>> +     * wrong with the Device Emulator. And it is most likely not safe
>> +     * to continue. So letting the vCPU to spin forever if the I/O 
>> never
>> +     * completes is a safer action than letting it continue and leaving
>> +     * the guest in unclear state and is the best what we can do for 
>> now.
>> +     *
>> +     * Please note, using this loop we will not spin forever on a pCPU,
>> +     * preventing any other vCPUs from being scheduled. At every loop
>> +     * we will call check_for_pcpu_work() that will process pending
>> +     * softirqs. In case of failure, the guest will crash and the vCPU
>> +     * will be unscheduled. In normal case, if the rescheduling is 
>> necessary
>> +     * (might be set by a timer or by a caller in 
>> check_for_vcpu_work(),
>> +     * the vCPU will be rescheduled to give place to someone else.
>
> TBH, I think this comment is a bit too much and sort of out of context 
> because this describing the inner implementation of 
> check_for_vcpu_work().
>
> How about the following:
>
> /*
>  * check_for_vcpu_work() may return true if there are more work to
>  * before the vCPU can safely resume. This gives us an opportunity
>  * to deschedule the vCPU if needed.
>  */

I am fine with that.

>
>> +     */
>> +    do {
>> +        check_for_pcpu_work();
>> +    } while ( check_for_vcpu_work() );
>
> So there are two important changes in this new implementation:
>   1) Without CONFIG_IOREQ_SERVER=y, we will call check_for_pcpu_work() 
> twice in a row when handling set/way.

hmm, yes

>
>   2) After handling the pCPU work, we will now return to the guest 
> directly. Before, we gave another opportunity for Xen to schedule a 
> different work. This means, we may return to the vCPU for a very short 
> time and will introduce more overhead.

yes, I haven't even imagined this could cause such difference in behavior

>
>
> So I would rework the loop to write it as:
>
> while ( check_for_pcpu_work() )
>    check_for_pcpu_work();
> check_for_pcpu_work();

makes sense, I assume you meant while ( check_for_vcpu_work() ) ...

>
>>         vgic_sync_to_lrs();
>>
>
> Cheers,
>
-- 
Regards,

Oleksandr Tyshchenko