From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============2327833967824695260==" MIME-Version: 1.0 From: Gavin Shan To: lkp@lists.01.org Subject: Re: [KVM] e923b0537d: kernel-selftests.kvm.rseq_test.fail Date: Wed, 17 Aug 2022 10:06:41 +1000 Message-ID: In-Reply-To: List-Id: --===============2327833967824695260== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Hi Sean, On 8/17/22 8:23 AM, Sean Christopherson wrote: > On Tue, Aug 16, 2022, Gavin Shan wrote: >> On 8/16/22 3:02 PM, Gavin Shan wrote: >>> On 8/16/22 7:42 AM, Sean Christopherson wrote: >>>> On Mon, Aug 15, 2022, kernel test robot wrote: >>>>> commit: e923b0537d28e15c9d31ce8b38f810b325816903 ("KVM: selftests: Fi= x target thread to be migrated in rseq_test") >>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master >>>> >>>> ... >>>> >>>>> # selftests: kvm: rseq_test >>>>> # =3D=3D=3D=3D Test Assertion Failure =3D=3D=3D=3D >>>>> #=C2=A0=C2=A0 rseq_test.c:278: i > (NR_TASK_MIGRATIONS / 2) >>>>> #=C2=A0=C2=A0 pid=3D49599 tid=3D49599 errno=3D4 - Interrupted system = call >>>>> #=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1=C2=A0=C2=A0=C2=A0 0x000000000040265= d: main at rseq_test.c:278 >>>>> #=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 2=C2=A0=C2=A0=C2=A0 0x00007fe44eed07f= c: ?? ??:0 >>>>> #=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 3=C2=A0=C2=A0=C2=A0 0x00000000004026d= 9: _start at ??:? >>>>> #=C2=A0=C2=A0 Only performed 23174 KVM_RUNs, task stalled too much? >>>>> # >>>>> not ok 56 selftests: kvm: rseq_test # exit=3D254 >>>> >>>> ... >>>> >>>>> # Automatically generated file; DO NOT EDIT. >>>>> # Linux/x86_64 5.19.0-rc6 Kernel Configuration >>>>> # >>>>> CONFIG_CC_VERSION_TEXT=3D"gcc-11 (Debian 11.3.0-3) 11.3.0" >>>>> CONFIG_CC_IS_GCC=3Dy >>>>> CONFIG_GCC_VERSION=3D110300 >>>>> CONFIG_CLANG_VERSION=3D0 >>>>> CONFIG_AS_IS_GNU=3Dy >>>>> CONFIG_AS_VERSION=3D23800 >>>>> CONFIG_LD_IS_BFD=3Dy >>>>> CONFIG_LD_VERSION=3D23800 >>>>> CONFIG_LLD_VERSION=3D0 >>>> >>>> Assuming 23800 =3D=3D 2.38, this is a known issue. >>>> >>>> https://lore.kernel.org/all/20220810104114.6838-1-gshan(a)redhat.com >>>> >>> >>> It's probably different story this time. > = > Doh, if I had bothered to actually look at the error message... > = Ok :) >>> The assert is triggered because of the following instructions. I would >>> guess the reason is vcpu thread has been running on CPU where we has hi= gh >>> CPU load. In this case, the vcpu thread can't be run in time. More >>> specific, the vcpu thread can't be run in the 1 - 10us time window, whi= ch >>> is specified by the migration worker (thread). >>> >>> =C2=A0=C2=A0=C2=A0 TEST_ASSERT(i > (NR_TASK_MIGRATIONS / 2), >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 "Only performed %d KVM_RUNs, task stalled too much?\n= ", i); >>> >>> I think we need to improve the handshake mechanism between the vcpu thr= ead >>> and migration worker. In current implementation, the handshake is done = through >>> the atomic counter. The mechanism is simple enough, but vcpu thread can= miss >>> the aforementioned time window. Another issue is the test case much mor= e time >>> than expected to finish. > = > There's not really an expected time to finish. The original purpose of t= he test > is to trigger a kernel race condition, so it's a balance between letting = the test > run long enough to have some confidence that the kernel is bug free, and = not running > so long that it wastes time. > = Yeah, I was thinking of it. It's why I'm not 100% sure for my proposal, to = have full synchronization. >>> Sean, if you think it's reasonable, I can figure out something to do: >>> >>> - Reuse the atomic counter for a full synchronization between these two >>> =C2=A0 threads. Something like below: >>> >>> =C2=A0 #define RSEQ_TEST_STATE_RUN_VCPU=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0 // vcpu_run() >>> =C2=A0 #define RSEQ_TEST_STATE_MIGRATE=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 1=C2=A0=C2=A0=C2=A0=C2=A0 // sched_setaffinity() >>> =C2=A0 #define RSEQ_TEST_STATE_CHECK=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 2=C2=A0=C2=A0=C2=A0=C2=A0 // Check rseq.cpu_id and ge= t_cpu() >>> >>> =C2=A0 The atomic counter is reset to RSEQ_TEST_STATE_RUN_VCPU after = RSEQ_TEST_STATE_RUN_VCPU > = > Again, because one of the primary goals is to ensure the kernel is race f= ree, the > test should avoid full synchronization. > = Ok. >>> >>> - Reduce NR_TASK_MIGRATIONS from 100000 to num_of_online_cpus(). With t= his, >>> =C2=A0 less time is needed to finish the test case. >>> >> >> I'm able to recreate the issue on my local arm64 system. >> >> - From the source code, the iteration count is changed from 100000 to 10= 00 >> - Only CPU#0 and CPU#1 are exposed in calc_min_max_cpu, meaning other CP= Us >> are cleared from @possible_mask >> - Run some CPU bound task on CPU#0 and CPU#1 >> # while true; do taskset -c 0 ./a; done >> # while true; do taskset -c 1 ./a; done >> - Run 'rseq_test' and hit the issue > = > At this point, this isn't a test bug. The test is right to complain that= it didn't > provide the coverage it's supposed to provide. > = > If the bot failure is a one-off, my preference is to leave things as-is f= or now. > If the failure is an ongoing issue, then we probably need to understand w= hy the > bot is failing. > = Yeah, the system for the coverage was likely having high CPU loads, which i= s similar to my (simulated) environment. I usually have my system being idle when run= ning the coverage test cases. I didn't hit this specific failure before. Lets leave it as of being. We can improve if needed in future :) Thanks, Gavin --===============2327833967824695260==--