regressions.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Gavin Shan <gshan@redhat.com>
Cc: kernel test robot <oliver.sang@intel.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Oliver Upton <oliver.upton@linux.dev>,
	Andrew Jones <andrew.jones@linux.dev>,
	LKML <linux-kernel@vger.kernel.org>,
	kvm@vger.kernel.org, lkp@lists.01.org, lkp@intel.com,
	xudong.hao@intel.com, regressions@lists.linux.dev
Subject: Re: [KVM] e923b0537d: kernel-selftests.kvm.rseq_test.fail
Date: Tue, 16 Aug 2022 22:23:01 +0000	[thread overview]
Message-ID: <YvwYxeE4vc/Srbil@google.com> (raw)
In-Reply-To: <9bfeae26-b4b1-eedb-6cbd-b4f9f1e1cc55@redhat.com>

On Tue, Aug 16, 2022, Gavin Shan wrote:
> Hi Sean,
> 
> On 8/16/22 3:02 PM, Gavin Shan wrote:
> > On 8/16/22 7:42 AM, Sean Christopherson wrote:
> > > On Mon, Aug 15, 2022, kernel test robot wrote:
> > > > commit: e923b0537d28e15c9d31ce8b38f810b325816903 ("KVM: selftests: Fix target thread to be migrated in rseq_test")
> > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > 
> > > ...
> > > 
> > > > # selftests: kvm: rseq_test
> > > > # ==== Test Assertion Failure ====
> > > > #   rseq_test.c:278: i > (NR_TASK_MIGRATIONS / 2)
> > > > #   pid=49599 tid=49599 errno=4 - Interrupted system call
> > > > #      1    0x000000000040265d: main at rseq_test.c:278
> > > > #      2    0x00007fe44eed07fc: ?? ??:0
> > > > #      3    0x00000000004026d9: _start at ??:?
> > > > #   Only performed 23174 KVM_RUNs, task stalled too much?
> > > > #
> > > > not ok 56 selftests: kvm: rseq_test # exit=254
> > > 
> > > ...
> > > 
> > > > # Automatically generated file; DO NOT EDIT.
> > > > # Linux/x86_64 5.19.0-rc6 Kernel Configuration
> > > > #
> > > > CONFIG_CC_VERSION_TEXT="gcc-11 (Debian 11.3.0-3) 11.3.0"
> > > > CONFIG_CC_IS_GCC=y
> > > > CONFIG_GCC_VERSION=110300
> > > > CONFIG_CLANG_VERSION=0
> > > > CONFIG_AS_IS_GNU=y
> > > > CONFIG_AS_VERSION=23800
> > > > CONFIG_LD_IS_BFD=y
> > > > CONFIG_LD_VERSION=23800
> > > > CONFIG_LLD_VERSION=0
> > > 
> > > Assuming 23800 == 2.38, this is a known issue.
> > > 
> > > https://lore.kernel.org/all/20220810104114.6838-1-gshan@redhat.com
> > > 
> > 
> > It's probably different story this time.

Doh, if I had bothered to actually look at the error message...

> > The assert is triggered because of the following instructions. I would
> > guess the reason is vcpu thread has been running on CPU where we has high
> > CPU load. In this case, the vcpu thread can't be run in time. More
> > specific, the vcpu thread can't be run in the 1 - 10us time window, which
> > is specified by the migration worker (thread).
> > 
> >      TEST_ASSERT(i > (NR_TASK_MIGRATIONS / 2),
> >                  "Only performed %d KVM_RUNs, task stalled too much?\n", i);
> > 
> > I think we need to improve the handshake mechanism between the vcpu thread
> > and migration worker. In current implementation, the handshake is done through
> > the atomic counter. The mechanism is simple enough, but vcpu thread can miss
> > the aforementioned time window. Another issue is the test case much more time
> > than expected to finish.

There's not really an expected time to finish.  The original purpose of the test
is to trigger a kernel race condition, so it's a balance between letting the test
run long enough to have some confidence that the kernel is bug free, and not running
so long that it wastes time.

> > Sean, if you think it's reasonable, I can figure out something to do:
> > 
> > - Reuse the atomic counter for a full synchronization between these two
> >    threads. Something like below:
> > 
> >    #define RSEQ_TEST_STATE_RUN_VCPU       0     // vcpu_run()
> >    #define RSEQ_TEST_STATE_MIGRATE        1     // sched_setaffinity()
> >    #define RSEQ_TEST_STATE_CHECK          2     // Check rseq.cpu_id and get_cpu()
> > 
> >    The atomic counter is reset to RSEQ_TEST_STATE_RUN_VCPU after RSEQ_TEST_STATE_RUN_VCPU

Again, because one of the primary goals is to ensure the kernel is race free, the
test should avoid full synchronization.

> > 
> > - Reduce NR_TASK_MIGRATIONS from 100000 to num_of_online_cpus(). With this,
> >    less time is needed to finish the test case.
> > 
> 
> I'm able to recreate the issue on my local arm64 system.
> 
> - From the source code, the iteration count is changed from 100000 to 1000
> - Only CPU#0 and CPU#1 are exposed in calc_min_max_cpu, meaning other CPUs
>   are cleared from @possible_mask
> - Run some CPU bound task on CPU#0 and CPU#1
>   # while true; do taskset -c 0 ./a; done
>   # while true; do taskset -c 1 ./a; done
> - Run 'rseq_test' and hit the issue

At this point, this isn't a test bug.  The test is right to complain that it didn't
provide the coverage it's supposed to provide.

If the bot failure is a one-off, my preference is to leave things as-is for now.
If the failure is an ongoing issue, then we probably need to understand why the
bot is failing.

  reply	other threads:[~2022-08-16 22:23 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-15  7:50 [KVM] e923b0537d: kernel-selftests.kvm.rseq_test.fail kernel test robot
2022-08-15 21:42 ` Sean Christopherson
2022-08-16  5:02   ` Gavin Shan
2022-08-16 10:02     ` Gavin Shan
2022-08-16 22:23       ` Sean Christopherson [this message]
2022-08-17  0:06         ` Gavin Shan
2022-09-05 10:47           ` [KVM] e923b0537d: kernel-selftests.kvm.rseq_test.fail #forregzbot Thorsten Leemhuis
2022-11-01  8:32       ` [KVM] e923b0537d: kernel-selftests.kvm.rseq_test.fail Robert Hoo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YvwYxeE4vc/Srbil@google.com \
    --to=seanjc@google.com \
    --cc=andrew.jones@linux.dev \
    --cc=gshan@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lkp@intel.com \
    --cc=lkp@lists.01.org \
    --cc=oliver.sang@intel.com \
    --cc=oliver.upton@linux.dev \
    --cc=pbonzini@redhat.com \
    --cc=regressions@lists.linux.dev \
    --cc=xudong.hao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).