Hi Sean, On 8/17/22 8:23 AM, Sean Christopherson wrote: > On Tue, Aug 16, 2022, Gavin Shan wrote: >> On 8/16/22 3:02 PM, Gavin Shan wrote: >>> On 8/16/22 7:42 AM, Sean Christopherson wrote: >>>> On Mon, Aug 15, 2022, kernel test robot wrote: >>>>> commit: e923b0537d28e15c9d31ce8b38f810b325816903 ("KVM: selftests: Fix target thread to be migrated in rseq_test") >>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master >>>> >>>> ... >>>> >>>>> # selftests: kvm: rseq_test >>>>> # ==== Test Assertion Failure ==== >>>>> #   rseq_test.c:278: i > (NR_TASK_MIGRATIONS / 2) >>>>> #   pid=49599 tid=49599 errno=4 - Interrupted system call >>>>> #      1    0x000000000040265d: main at rseq_test.c:278 >>>>> #      2    0x00007fe44eed07fc: ?? ??:0 >>>>> #      3    0x00000000004026d9: _start at ??:? >>>>> #   Only performed 23174 KVM_RUNs, task stalled too much? >>>>> # >>>>> not ok 56 selftests: kvm: rseq_test # exit=254 >>>> >>>> ... >>>> >>>>> # Automatically generated file; DO NOT EDIT. >>>>> # Linux/x86_64 5.19.0-rc6 Kernel Configuration >>>>> # >>>>> CONFIG_CC_VERSION_TEXT="gcc-11 (Debian 11.3.0-3) 11.3.0" >>>>> CONFIG_CC_IS_GCC=y >>>>> CONFIG_GCC_VERSION=110300 >>>>> CONFIG_CLANG_VERSION=0 >>>>> CONFIG_AS_IS_GNU=y >>>>> CONFIG_AS_VERSION=23800 >>>>> CONFIG_LD_IS_BFD=y >>>>> CONFIG_LD_VERSION=23800 >>>>> CONFIG_LLD_VERSION=0 >>>> >>>> Assuming 23800 == 2.38, this is a known issue. >>>> >>>> https://lore.kernel.org/all/20220810104114.6838-1-gshan(a)redhat.com >>>> >>> >>> It's probably different story this time. > > Doh, if I had bothered to actually look at the error message... > Ok :) >>> The assert is triggered because of the following instructions. I would >>> guess the reason is vcpu thread has been running on CPU where we has high >>> CPU load. In this case, the vcpu thread can't be run in time. More >>> specific, the vcpu thread can't be run in the 1 - 10us time window, which >>> is specified by the migration worker (thread). >>> >>>     TEST_ASSERT(i > (NR_TASK_MIGRATIONS / 2), >>>                 "Only performed %d KVM_RUNs, task stalled too much?\n", i); >>> >>> I think we need to improve the handshake mechanism between the vcpu thread >>> and migration worker. In current implementation, the handshake is done through >>> the atomic counter. The mechanism is simple enough, but vcpu thread can miss >>> the aforementioned time window. Another issue is the test case much more time >>> than expected to finish. > > There's not really an expected time to finish. The original purpose of the test > is to trigger a kernel race condition, so it's a balance between letting the test > run long enough to have some confidence that the kernel is bug free, and not running > so long that it wastes time. > Yeah, I was thinking of it. It's why I'm not 100% sure for my proposal, to have full synchronization. >>> Sean, if you think it's reasonable, I can figure out something to do: >>> >>> - Reuse the atomic counter for a full synchronization between these two >>>   threads. Something like below: >>> >>>   #define RSEQ_TEST_STATE_RUN_VCPU       0     // vcpu_run() >>>   #define RSEQ_TEST_STATE_MIGRATE        1     // sched_setaffinity() >>>   #define RSEQ_TEST_STATE_CHECK          2     // Check rseq.cpu_id and get_cpu() >>> >>>   The atomic counter is reset to RSEQ_TEST_STATE_RUN_VCPU after RSEQ_TEST_STATE_RUN_VCPU > > Again, because one of the primary goals is to ensure the kernel is race free, the > test should avoid full synchronization. > Ok. >>> >>> - Reduce NR_TASK_MIGRATIONS from 100000 to num_of_online_cpus(). With this, >>>   less time is needed to finish the test case. >>> >> >> I'm able to recreate the issue on my local arm64 system. >> >> - From the source code, the iteration count is changed from 100000 to 1000 >> - Only CPU#0 and CPU#1 are exposed in calc_min_max_cpu, meaning other CPUs >> are cleared from @possible_mask >> - Run some CPU bound task on CPU#0 and CPU#1 >> # while true; do taskset -c 0 ./a; done >> # while true; do taskset -c 1 ./a; done >> - Run 'rseq_test' and hit the issue > > At this point, this isn't a test bug. The test is right to complain that it didn't > provide the coverage it's supposed to provide. > > If the bot failure is a one-off, my preference is to leave things as-is for now. > If the failure is an ongoing issue, then we probably need to understand why the > bot is failing. > Yeah, the system for the coverage was likely having high CPU loads, which is similar to my (simulated) environment. I usually have my system being idle when running the coverage test cases. I didn't hit this specific failure before. Lets leave it as of being. We can improve if needed in future :) Thanks, Gavin