All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alexandru Elisei <alexandru.elisei@arm.com>
To: Zenghui Yu <yuzenghui@huawei.com>,
	kvm@vger.kernel.org, kvmarm@lists.cs.columbia.edu,
	drjones@redhat.com
Cc: andre.przywara@arm.com
Subject: Re: [kvm-unit-tests PATCH 10/10] arm64: gic: Use IPI test checking for the LPI tests
Date: Tue, 1 Dec 2020 15:09:44 +0000	[thread overview]
Message-ID: <1dc913fd-80f5-4c8b-9c4d-52a1f0eee400@arm.com> (raw)
In-Reply-To: <eb8f5df2-6ea5-7aad-344a-70fcef3adad8@arm.com>

Hi,

On 11/30/20 2:19 PM, Alexandru Elisei wrote:
> Hi Zenghui,
>
> On 11/30/20 1:59 PM, Zenghui Yu wrote:
>> Hi Alex,
>>
>> On 2020/11/27 22:50, Alexandru Elisei wrote:
>>> Hi Zhenghui,
>>>
>>> Thank you for having a look at this!
>>>
>>> On 11/26/20 9:30 AM, Zenghui Yu wrote:
>>>> On 2020/11/25 23:51, Alexandru Elisei wrote:
>>>>> The reason for the failure is that the test "dev2/eventid=20 now triggers
>>>>> an LPI" triggers 2 LPIs, not one. This behavior was present before this
>>>>> patch, but it was ignored because check_lpi_stats() wasn't looking at the
>>>>> acked array.
>>>>>
>>>>> I'm not familiar with the ITS so I'm not sure if this is expected, if the
>>>>> test is incorrect or if there is something wrong with KVM emulation.
>>>> I think this is expected, or not.
>>>>
>>>> Before INVALL, the LPI-8195 was already pending but disabled. On
>>>> receiving INVALL, VGIC will reload configuration for all LPIs targeting
>>>> collection-3 and deliver the now enabled LPI-8195. We'll therefore see
>>>> and handle it before sending the following INT (which will set the
>>>> LPI-8195 pending again).
>>>>
>>>>> Did some more testing on an Ampere eMAG (fast out-of-order cores) using
>>>>> qemu and kvmtool and Linux v5.8, here's what I found:
>>>>>
>>>>> - Using qemu and gic.flat built from*master*: error encountered 864 times
>>>>>     out of 1088 runs.
>>>>> - Using qemu: error encountered 852 times out of 1027 runs.
>>>>> - Using kvmtool: error encountered 8164 times out of 10602 runs.
>>>> If vcpu-3 hadn't seen and handled LPI-8195 as quickly as possible (e.g.,
>>>> vcpu-3 hadn't been scheduled), the following INT will set the already
>>>> pending LPI-8195 pending again and we'll receive it *once* on vcpu-3.
>>>> And we won't see the mentioned failure.
>>>>
>>>> I think we can just drop the (meaningless and confusing?) INT.
>>> I think I understand your explanation, the VCPU takes the interrupt immediately
>>> after the INVALL and before the INT, and the second interrupt that I am seeing is
>>> the one caused by the INT command.
>> Yes.
>>
>>> I tried modifying the test like this:
>>>
>>> diff --git a/arm/gic.c b/arm/gic.c
>>> index 6e93da80fe0d..0ef8c12ea234 100644
>>> --- a/arm/gic.c
>>> +++ b/arm/gic.c
>>> @@ -761,10 +761,17 @@ static void test_its_trigger(void)
>>>          wmb();
>>>          cpumask_clear(&mask);
>>>          cpumask_set_cpu(3, &mask);
>>> -       its_send_int(dev2, 20);
>> Shouldn't its_send_invall(col3) be moved down here? See below.
>>
>>>          wait_for_interrupts(&mask);
>>>          report(check_acked(&mask, 0, 8195),
>>> -                       "dev2/eventid=20 now triggers an LPI");
>>> +                       "dev2/eventid=20 pending LPI is received");
>>> +
>>> +       stats_reset();
>>> +       wmb();
>>> +       cpumask_clear(&mask);
>>> +       cpumask_set_cpu(3, &mask);
>>> +       its_send_int(dev2, 20);
>>> +       wait_for_interrupts(&mask);
>>> +       report(check_acked(&mask, 0, 8195), "dev2/eventid=20 triggers an LPI");
>>>            report_prefix_pop();
>>>   I removed the INT from the initial test, and added a separate one to check that
>>> the INT command still works. That looks to me that preserves the spirit of the
>>> original test. After doing stress testing this is what I got:
>>>
>>> - with kvmtool, 47,709 iterations, 27 times the test timed out when waiting for
>>> the interrupt after INVALL.
>>> - with qemu, 15,511 iterations, 258 times the test timed out when waiting for the
>>> interrupt after INVALL, just like with kvmtool.
>> I guess the reason of failure is that the LPI is taken *immediately*
>> after the INVALL?
>>
>>     /* Now call the invall and check the LPI hits */
>>     its_send_invall(col3);
>>         <- LPI is taken, acked[]++
>>     stats_reset();
>>         <- acked[] is cleared unexpectedly
>>     wmb();
>>     cpumask_clear(&mask);
>>     cpumask_set_cpu(3, &mask);
>>     wait_for_interrupts(&mask);
>>         <- we'll hit timed-out since acked[] is 0
> Yes, of course, you're right, I didn't realize that I was resetting the stats
> *after* the interrupt was enabled. This also explains why I was still seeing
> timeouts even when the timeout duration was set to 50 seconds. I'll retest with
> the fix:
>
> diff --git a/arm/gic.c b/arm/gic.c
> index 6e93da80fe0d..c4240f5aba39 100644
> --- a/arm/gic.c
> +++ b/arm/gic.c
> @@ -756,15 +756,22 @@ static void test_its_trigger(void)
>                         "dev2/eventid=20 still does not trigger any LPI");
>  
>         /* Now call the invall and check the LPI hits */
> +       stats_reset();
> +       wmb();
> +       cpumask_clear(&mask);
> +       cpumask_set_cpu(3, &mask);
>         its_send_invall(col3);
> +       wait_for_interrupts(&mask);
> +       report(check_acked(&mask, 0, 8195),
> +                       "dev2/eventid=20 pending LPI is received");
> +
>         stats_reset();
>         wmb();
>         cpumask_clear(&mask);
>         cpumask_set_cpu(3, &mask);
>         its_send_int(dev2, 20);
>         wait_for_interrupts(&mask);
> -       report(check_acked(&mask, 0, 8195),
> -                       "dev2/eventid=20 now triggers an LPI");
> +       report(check_acked(&mask, 0, 8195), "dev2/eventid20 triggers an LPI");
>  
>         report_prefix_pop();
>  
> I also pushed a branch at [1].
>
> Thank you so much for spotting this! You've saved me (and probably others) a lot
> of time debugging.
>
> [1] https://gitlab.arm.com/linux-arm/kvm-unit-tests-ae/-/tree/fixes1-v2

I have been testing the branch, no failures after 17,996 runs with qemu and 58,669
runs with kvmtool. This looks fine to me, I'll send a v2 with the fix.

Thanks,
Alex

WARNING: multiple messages have this Message-ID (diff)
From: Alexandru Elisei <alexandru.elisei@arm.com>
To: Zenghui Yu <yuzenghui@huawei.com>,
	kvm@vger.kernel.org, kvmarm@lists.cs.columbia.edu,
	drjones@redhat.com
Cc: andre.przywara@arm.com
Subject: Re: [kvm-unit-tests PATCH 10/10] arm64: gic: Use IPI test checking for the LPI tests
Date: Tue, 1 Dec 2020 15:09:44 +0000	[thread overview]
Message-ID: <1dc913fd-80f5-4c8b-9c4d-52a1f0eee400@arm.com> (raw)
In-Reply-To: <eb8f5df2-6ea5-7aad-344a-70fcef3adad8@arm.com>

Hi,

On 11/30/20 2:19 PM, Alexandru Elisei wrote:
> Hi Zenghui,
>
> On 11/30/20 1:59 PM, Zenghui Yu wrote:
>> Hi Alex,
>>
>> On 2020/11/27 22:50, Alexandru Elisei wrote:
>>> Hi Zhenghui,
>>>
>>> Thank you for having a look at this!
>>>
>>> On 11/26/20 9:30 AM, Zenghui Yu wrote:
>>>> On 2020/11/25 23:51, Alexandru Elisei wrote:
>>>>> The reason for the failure is that the test "dev2/eventid=20 now triggers
>>>>> an LPI" triggers 2 LPIs, not one. This behavior was present before this
>>>>> patch, but it was ignored because check_lpi_stats() wasn't looking at the
>>>>> acked array.
>>>>>
>>>>> I'm not familiar with the ITS so I'm not sure if this is expected, if the
>>>>> test is incorrect or if there is something wrong with KVM emulation.
>>>> I think this is expected, or not.
>>>>
>>>> Before INVALL, the LPI-8195 was already pending but disabled. On
>>>> receiving INVALL, VGIC will reload configuration for all LPIs targeting
>>>> collection-3 and deliver the now enabled LPI-8195. We'll therefore see
>>>> and handle it before sending the following INT (which will set the
>>>> LPI-8195 pending again).
>>>>
>>>>> Did some more testing on an Ampere eMAG (fast out-of-order cores) using
>>>>> qemu and kvmtool and Linux v5.8, here's what I found:
>>>>>
>>>>> - Using qemu and gic.flat built from*master*: error encountered 864 times
>>>>>     out of 1088 runs.
>>>>> - Using qemu: error encountered 852 times out of 1027 runs.
>>>>> - Using kvmtool: error encountered 8164 times out of 10602 runs.
>>>> If vcpu-3 hadn't seen and handled LPI-8195 as quickly as possible (e.g.,
>>>> vcpu-3 hadn't been scheduled), the following INT will set the already
>>>> pending LPI-8195 pending again and we'll receive it *once* on vcpu-3.
>>>> And we won't see the mentioned failure.
>>>>
>>>> I think we can just drop the (meaningless and confusing?) INT.
>>> I think I understand your explanation, the VCPU takes the interrupt immediately
>>> after the INVALL and before the INT, and the second interrupt that I am seeing is
>>> the one caused by the INT command.
>> Yes.
>>
>>> I tried modifying the test like this:
>>>
>>> diff --git a/arm/gic.c b/arm/gic.c
>>> index 6e93da80fe0d..0ef8c12ea234 100644
>>> --- a/arm/gic.c
>>> +++ b/arm/gic.c
>>> @@ -761,10 +761,17 @@ static void test_its_trigger(void)
>>>          wmb();
>>>          cpumask_clear(&mask);
>>>          cpumask_set_cpu(3, &mask);
>>> -       its_send_int(dev2, 20);
>> Shouldn't its_send_invall(col3) be moved down here? See below.
>>
>>>          wait_for_interrupts(&mask);
>>>          report(check_acked(&mask, 0, 8195),
>>> -                       "dev2/eventid=20 now triggers an LPI");
>>> +                       "dev2/eventid=20 pending LPI is received");
>>> +
>>> +       stats_reset();
>>> +       wmb();
>>> +       cpumask_clear(&mask);
>>> +       cpumask_set_cpu(3, &mask);
>>> +       its_send_int(dev2, 20);
>>> +       wait_for_interrupts(&mask);
>>> +       report(check_acked(&mask, 0, 8195), "dev2/eventid=20 triggers an LPI");
>>>            report_prefix_pop();
>>>   I removed the INT from the initial test, and added a separate one to check that
>>> the INT command still works. That looks to me that preserves the spirit of the
>>> original test. After doing stress testing this is what I got:
>>>
>>> - with kvmtool, 47,709 iterations, 27 times the test timed out when waiting for
>>> the interrupt after INVALL.
>>> - with qemu, 15,511 iterations, 258 times the test timed out when waiting for the
>>> interrupt after INVALL, just like with kvmtool.
>> I guess the reason of failure is that the LPI is taken *immediately*
>> after the INVALL?
>>
>>     /* Now call the invall and check the LPI hits */
>>     its_send_invall(col3);
>>         <- LPI is taken, acked[]++
>>     stats_reset();
>>         <- acked[] is cleared unexpectedly
>>     wmb();
>>     cpumask_clear(&mask);
>>     cpumask_set_cpu(3, &mask);
>>     wait_for_interrupts(&mask);
>>         <- we'll hit timed-out since acked[] is 0
> Yes, of course, you're right, I didn't realize that I was resetting the stats
> *after* the interrupt was enabled. This also explains why I was still seeing
> timeouts even when the timeout duration was set to 50 seconds. I'll retest with
> the fix:
>
> diff --git a/arm/gic.c b/arm/gic.c
> index 6e93da80fe0d..c4240f5aba39 100644
> --- a/arm/gic.c
> +++ b/arm/gic.c
> @@ -756,15 +756,22 @@ static void test_its_trigger(void)
>                         "dev2/eventid=20 still does not trigger any LPI");
>  
>         /* Now call the invall and check the LPI hits */
> +       stats_reset();
> +       wmb();
> +       cpumask_clear(&mask);
> +       cpumask_set_cpu(3, &mask);
>         its_send_invall(col3);
> +       wait_for_interrupts(&mask);
> +       report(check_acked(&mask, 0, 8195),
> +                       "dev2/eventid=20 pending LPI is received");
> +
>         stats_reset();
>         wmb();
>         cpumask_clear(&mask);
>         cpumask_set_cpu(3, &mask);
>         its_send_int(dev2, 20);
>         wait_for_interrupts(&mask);
> -       report(check_acked(&mask, 0, 8195),
> -                       "dev2/eventid=20 now triggers an LPI");
> +       report(check_acked(&mask, 0, 8195), "dev2/eventid20 triggers an LPI");
>  
>         report_prefix_pop();
>  
> I also pushed a branch at [1].
>
> Thank you so much for spotting this! You've saved me (and probably others) a lot
> of time debugging.
>
> [1] https://gitlab.arm.com/linux-arm/kvm-unit-tests-ae/-/tree/fixes1-v2

I have been testing the branch, no failures after 17,996 runs with qemu and 58,669
runs with kvmtool. This looks fine to me, I'll send a v2 with the fix.

Thanks,
Alex
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

  reply	other threads:[~2020-12-01 15:09 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-25 15:51 [kvm-unit-tests PATCH 00/10] GIC fixes and improvements Alexandru Elisei
2020-11-25 15:51 ` Alexandru Elisei
2020-11-25 15:51 ` [kvm-unit-tests PATCH 01/10] lib: arm/arm64: gicv3: Add missing barrier when sending IPIs Alexandru Elisei
2020-11-25 15:51   ` Alexandru Elisei
2020-12-01 16:37   ` Auger Eric
2020-12-01 16:37     ` Auger Eric
2020-12-01 17:37     ` Alexandru Elisei
2020-12-01 17:37       ` Alexandru Elisei
2020-11-25 15:51 ` [kvm-unit-tests PATCH 02/10] lib: arm/arm64: gicv2: " Alexandru Elisei
2020-11-25 15:51   ` Alexandru Elisei
2020-12-01 16:37   ` Auger Eric
2020-12-01 16:37     ` Auger Eric
2020-11-25 15:51 ` [kvm-unit-tests PATCH 03/10] arm/arm64: gic: Remove memory synchronization from ipi_clear_active_handler() Alexandru Elisei
2020-11-25 15:51   ` Alexandru Elisei
2020-12-01 16:37   ` Auger Eric
2020-12-01 16:37     ` Auger Eric
2020-12-02 14:02     ` Alexandru Elisei
2020-12-02 14:02       ` Alexandru Elisei
2020-12-02 14:14       ` Alexandru Elisei
2020-12-02 14:14         ` Alexandru Elisei
2020-12-03  9:41         ` Auger Eric
2020-12-03  9:41           ` Auger Eric
2020-11-25 15:51 ` [kvm-unit-tests PATCH 04/10] arm/arm64: gic: Remove unnecessary synchronization with stats_reset() Alexandru Elisei
2020-11-25 15:51   ` Alexandru Elisei
2020-12-01 16:48   ` Auger Eric
2020-12-01 16:48     ` Auger Eric
2020-12-02 14:06     ` Alexandru Elisei
2020-12-02 14:06       ` Alexandru Elisei
2020-12-03 13:10   ` Auger Eric
2020-12-03 13:10     ` Auger Eric
2020-11-25 15:51 ` [kvm-unit-tests PATCH 05/10] arm/arm64: gic: Use correct memory ordering for the IPI test Alexandru Elisei
2020-11-25 15:51   ` Alexandru Elisei
2020-12-03 13:10   ` Auger Eric
2020-12-03 13:10     ` Auger Eric
2020-12-03 13:21     ` Alexandru Elisei
2020-12-03 13:21       ` Alexandru Elisei
2020-11-25 15:51 ` [kvm-unit-tests PATCH 06/10] arm/arm64: gic: Check spurious and bad_sender in the active test Alexandru Elisei
2020-11-25 15:51   ` Alexandru Elisei
2020-12-03 13:10   ` Auger Eric
2020-12-03 13:10     ` Auger Eric
2020-11-25 15:51 ` [kvm-unit-tests PATCH 07/10] arm/arm64: gic: Wait for writes to acked or spurious to complete Alexandru Elisei
2020-11-25 15:51   ` Alexandru Elisei
2020-12-03 13:21   ` Auger Eric
2020-12-03 13:21     ` Auger Eric
2020-11-25 15:51 ` [kvm-unit-tests PATCH 08/10] arm/arm64: gic: Split check_acked() into two functions Alexandru Elisei
2020-11-25 15:51   ` Alexandru Elisei
2020-12-03 13:39   ` Auger Eric
2020-12-03 13:39     ` Auger Eric
2020-12-10 14:45     ` Alexandru Elisei
2020-12-10 14:45       ` Alexandru Elisei
2020-12-15 13:58       ` Auger Eric
2020-12-15 13:58         ` Auger Eric
2020-12-16 11:40         ` Alexandru Elisei
2020-12-16 11:40           ` Alexandru Elisei
2020-12-16 12:37           ` Auger Eric
2020-12-16 12:37             ` Auger Eric
2020-11-25 15:51 ` [kvm-unit-tests PATCH 09/10] arm/arm64: gic: Make check_acked() more generic Alexandru Elisei
2020-11-25 15:51   ` Alexandru Elisei
2020-12-03 14:59   ` Auger Eric
2020-12-03 14:59     ` Auger Eric
2020-11-25 15:51 ` [kvm-unit-tests PATCH 10/10] arm64: gic: Use IPI test checking for the LPI tests Alexandru Elisei
2020-11-25 15:51   ` Alexandru Elisei
2020-11-26  9:30   ` Zenghui Yu
2020-11-26  9:30     ` Zenghui Yu
2020-11-27 14:50     ` Alexandru Elisei
2020-11-27 14:50       ` Alexandru Elisei
2020-11-30 13:59       ` Zenghui Yu
2020-11-30 13:59         ` Zenghui Yu
2020-11-30 14:19         ` Alexandru Elisei
2020-11-30 14:19           ` Alexandru Elisei
2020-12-01 15:09           ` Alexandru Elisei [this message]
2020-12-01 15:09             ` Alexandru Elisei
2020-11-30 17:48     ` Auger Eric
2020-11-30 17:48       ` Auger Eric
2020-12-03 14:59   ` Auger Eric
2020-12-03 14:59     ` Auger Eric
2020-12-09 10:29     ` Alexandru Elisei
2020-12-09 10:29       ` Alexandru Elisei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1dc913fd-80f5-4c8b-9c4d-52a1f0eee400@arm.com \
    --to=alexandru.elisei@arm.com \
    --cc=andre.przywara@arm.com \
    --cc=drjones@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.cs.columbia.edu \
    --cc=yuzenghui@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.