From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.5 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E4687C432BE for ; Fri, 27 Aug 2021 19:12:07 +0000 (UTC) Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id ABB3160F39 for ; Fri, 27 Aug 2021 19:12:07 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org ABB3160F39 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=efficios.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:MIME-Version:Subject:References: In-Reply-To:Message-ID:Cc:To:From:Date:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Owner; bh=1BsMnxRRi04uh0qfGLjUqzA5GUinmrtQqzdESLTMwA0=; b=JV4UQCi63yjCMPMyzEIBppVZYm 6sj0u5lID0fvbxIhqFo2arAjsPBuv/OMZJt0JMQ/Fq6PVxWuqdEbVb01tAjNnsdC1L3W2TrVlrx30 vmg0KDtqtpvDoy0ucB64gAJtFTfHbcoaAlD1fnrREp39LAcX11pbK9tdBsHSpqxyvRuJVCNU+tn4I 2CH30ejrgdp+YNp5uddIqwC55HfSNGOOmT+Y2g+lJ/yzj1uSghV4oWmTWjFCvu0s4/Xr/J5WV+zRE WAPtSxFKFt/xnWpWPUVvYms7a2jZ8Od1OPocztFZFX8QUd4OfNMpPfAu6yCStpO6sTRS6PyHkdhrx g4NY7xnw==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1mJhEf-00D30l-TG; Fri, 27 Aug 2021 19:09:54 +0000 Received: from mail.efficios.com ([167.114.26.124]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1mJhEU-00D2v1-IV for linux-arm-kernel@lists.infradead.org; Fri, 27 Aug 2021 19:09:47 +0000 Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id 35CF8381D4D; Fri, 27 Aug 2021 15:09:35 -0400 (EDT) Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id bdpbmavpTcDf; Fri, 27 Aug 2021 15:09:34 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id 70AB2381CCB; Fri, 27 Aug 2021 15:09:34 -0400 (EDT) DKIM-Filter: OpenDKIM Filter v2.10.3 mail.efficios.com 70AB2381CCB DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=efficios.com; s=default; t=1630091374; bh=a+xsJEpOCH31ksEHBk6kwF1tUR+tILdkL5tisWRdxI0=; h=Date:From:To:Message-ID:MIME-Version; b=cZCl7fQ/RTUAu0JMKb52619qhmst5OKPbEtnktbVMX/PIHufPwNIz98Ws9VY3/hy4 tL/OEN4pzYr3Qp9c6LsakEcVPL42DmIBvwb0DHeXYpf+36mtzHghXaDhPpzwd132KK hEGDyCXfz7EzVSVP6SPQEDmeEGlUwzPCYdoSc1QaK5k5qgXsewClns9NDZ5b/ANSk9 QAIBXoJdR3MOeo9N8iFtFpTi7L41mC/UV6LCTb8bhnUV6jRlXRnwh3eKkShIA/awvz X6y/j+YsW2oArKIqHNlYlgPu+EvmE4w8fkZeC4LXw6VXW0bGAFOxuNvAsFq0SA9oq1 ckDQ0iaaY4cpQ== X-Virus-Scanned: amavisd-new at efficios.com Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id nRsCb8Qy1cNv; Fri, 27 Aug 2021 15:09:34 -0400 (EDT) Received: from mail03.efficios.com (mail03.efficios.com [167.114.26.124]) by mail.efficios.com (Postfix) with ESMTP id 3381D381E39; Fri, 27 Aug 2021 15:09:34 -0400 (EDT) Date: Fri, 27 Aug 2021 15:09:34 -0400 (EDT) From: Mathieu Desnoyers To: Sean Christopherson Cc: dvhart , "Russell King, ARM Linux" , Catalin Marinas , Will Deacon , Guo Ren , Thomas Bogendoerfer , Michael Ellerman , Heiko Carstens , gor , Christian Borntraeger , rostedt , Ingo Molnar , Oleg Nesterov , Thomas Gleixner , Peter Zijlstra , Andy Lutomirski , paulmck , Boqun Feng , Paolo Bonzini , shuah , Benjamin Herrenschmidt , Paul Mackerras , linux-arm-kernel , linux-kernel , linux-csky , linux-mips , linuxppc-dev , linux-s390 , KVM list , linux-kselftest , Peter Foley , Shakeel Butt , Ben Gardon Message-ID: <339641531.29941.1630091374065.JavaMail.zimbra@efficios.com> In-Reply-To: References: <20210820225002.310652-1-seanjc@google.com> <20210820225002.310652-5-seanjc@google.com> <766990430.21713.1629731934069.JavaMail.zimbra@efficios.com> <282257549.21721.1629732017655.JavaMail.zimbra@efficios.com> <1700758714.29394.1630003332081.JavaMail.zimbra@efficios.com> Subject: Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs MIME-Version: 1.0 X-Originating-IP: [167.114.26.124] X-Mailer: Zimbra 8.8.15_GA_4101 (ZimbraWebClient - FF91 (Linux)/8.8.15_GA_4059) Thread-Topic: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs Thread-Index: Cat4/nZnF/JdYT0CBeoaNK9t7xzyjQ== X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20210827_120942_773446_97C14128 X-CRM114-Status: GOOD ( 47.86 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org ----- On Aug 26, 2021, at 7:54 PM, Sean Christopherson seanjc@google.com wrote: > On Thu, Aug 26, 2021, Mathieu Desnoyers wrote: >> ----- On Aug 25, 2021, at 8:51 PM, Sean Christopherson seanjc@google.com wrote: >> >> >> + r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask); >> >> >> + TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)", >> >> >> + errno, strerror(errno)); >> >> >> + atomic_inc(&seq_cnt); >> >> >> + >> >> >> + CPU_CLR(cpu, &allowed_mask); >> >> >> + >> >> >> + /* >> >> >> + * Let the read-side get back into KVM_RUN to improve the odds >> >> >> + * of task migration coinciding with KVM's run loop. >> >> > >> >> > This comment should be about increasing the odds of letting the seqlock >> >> > read-side complete. Otherwise, the delay between the two back-to-back >> >> > atomic_inc is so small that the seqlock read-side may never have time to >> >> > complete the reading the rseq cpu id and the sched_getcpu() call, and can >> >> > retry forever. >> > >> > Hmm, but that's not why there's a delay. I'm not arguing that a livelock isn't >> > possible (though that syscall would have to be screaming fast), >> >> I don't think we have the same understanding of the livelock scenario. AFAIU the >> livelock >> would be caused by a too-small delay between the two consecutive atomic_inc() >> from one >> loop iteration to the next compared to the time it takes to perform a read-side >> critical >> section of the seqlock. Back-to-back atomic_inc can be performed very quickly, >> so I >> doubt that the sched_getcpu implementation have good odds to be fast enough to >> complete >> in that narrow window, leading to lots of read seqlock retry. > > Ooooh, yeah, brain fart on my side. I was thinking of the two atomic_inc() in > the > same loop iteration and completely ignoring the next iteration. Yes, I 100% > agree > that a delay to ensure forward progress is needed. An assertion in main() that > the > reader complete at least some reasonable number of KVM_RUNs is also probably a > good > idea, e.g. to rule out a false pass due to the reader never making forward > progress. Agreed. > > FWIW, the do-while loop does make forward progress without a delay, but at ~50% > throughput, give or take. I did not expect absolutely no progress, but a significant slow down of the read-side. > >> > but the primary motivation is very much to allow the read-side enough time >> > to get back into KVM proper. >> >> I'm puzzled by your statement. AFAIU, let's say we don't have the delay, then we >> have: >> >> migration thread KVM_RUN/read-side thread >> ----------------------------------------------------------------------------------- >> - ioctl(KVM_RUN) >> - atomic_inc_seq_cst(&seqcnt) >> - sched_setaffinity >> - atomic_inc_seq_cst(&seqcnt) >> - a = atomic_load(&seqcnt) & ~1 >> - smp_rmb() >> - b = LOAD_ONCE(__rseq_abi->cpu_id); >> - sched_getcpu() >> - smp_rmb() >> - re-load seqcnt/compare (succeeds) >> - Can only succeed if entire read-side happens while the seqcnt >> is in an even numbered state. >> - if (a != b) abort() >> /* no delay. Even counter state is very >> short. */ >> - atomic_inc_seq_cst(&seqcnt) >> /* Let's suppose the lack of delay causes the >> setaffinity to complete too early compared >> with KVM_RUN ioctl */ >> - sched_setaffinity >> - atomic_inc_seq_cst(&seqcnt) >> >> /* no delay. Even counter state is very >> short. */ >> - atomic_inc_seq_cst(&seqcnt) >> /* Then a setaffinity from a following >> migration thread loop will run >> concurrently with KVM_RUN */ >> - ioctl(KVM_RUN) >> - sched_setaffinity >> - atomic_inc_seq_cst(&seqcnt) >> >> As pointed out here, if the first setaffinity runs too early compared with >> KVM_RUN, >> a following setaffinity will run concurrently with it. However, the fact that >> the even counter state is very short may very well hurt progress of the read >> seqlock. > > *sigh* > > Several hours later, I think I finally have my head wrapped around everything. > > Due to the way the test is written and because of how KVM's run loop works, > TIF_NOTIFY_RESUME or TIF_NEED_RESCHED effectively has to be set before KVM > actually > enters the guest, otherwise KVM will exit to userspace without touching the > flag, > i.e. it will be handled by the normal exit_to_user_mode_loop(). > > Where I got lost was trying to figure out why I couldn't make the bug reproduce > by > causing the guest to exit to KVM, but not userspace, in which case KVM should > easily trigger the problematic flow as the window for sched_getcpu() to collide > with KVM would be enormous. The reason I didn't go down this route for the > "official" test is that, unless there's something clever I'm overlooking, it > requires arch specific guest code, and ialso I don't know that forcing an exit > would even be necessary/sufficient on other architectures. > > Anyways, I was trying to confirm that the bug was being hit without a delay, > while > still retaining the sequence retry in the test. The test doesn't fail because > the > back-to-back atomic_inc() changes seqcnt too fast. The read-side makes forward > progress, but it never observes failure because the do-while loop only ever > completes after another sched_setaffinity(), never after the one that collides > with KVM because it takes too long to get out of ioctl(KVM_RUN) and back to the > test. I.e. the atomic_inc() in the next loop iteration (makes seq_cnt odd) > always > completes before the check, and so the check ends up spinning until another > migration, which correctly updates rseq. This was expected and didn't confuse > me. > > What confused me is that I was trying to confirm the bug was being hit from > within > the kernel by confirming KVM observed TIF_NOTIFY_RESUME, but I misunderstood > when > TIF_NOTIFY_RESUME would get set. KVM can observe TIF_NOTIFY_RESUME directly, > but > it's rare, and I suspect happens iff sched_setaffinity() hits the small window > where > it collides with KVM_RUN before KVM enters the guest. > > More commonly, the bug occurs when KVM sees TIF_NEED_RESCHED. In that case, KVM > calls xfer_to_guest_mode_work(), which does schedule() and _that_ sets > TIF_NOTIFY_RESUME. xfer_to_guest_mode_work() then mishandles TIF_NOTIFY_RESUME > and the bug is hit, but my confirmation logic in KVM never fired. > > So there are effectively three reasons we want a delay: > > 1. To allow sched_setaffinity() to coincide with ioctl(KVM_RUN) before KVM can > enter the guest so that the guest doesn't need an arch-specific VM-Exit source. > > 2. To let ioctl(KVM_RUN) make its way back to the test before the next round > of migration. > > 3. To ensure the read-side can make forward progress, e.g. if sched_getcpu() > involves a syscall. > > > After looking at KVM for arm64 and s390, #1 is a bit tenuous because x86 is the > only arch that currently uses xfer_to_guest_mode_work(), i.e. the test could be > tweaked to be overtly x86-specific. But since a delay is needed for #2 and #3, > I'd prefer to rely on it for #1 as well in the hopes that this test provides > coverage for arm64 and/or s390 if they're ever converted to use the common > xfer_to_guest_mode_work(). Now that we have this understanding of why we need the delay, it would be good to write this down in a comment within the test. Does it reproduce if we randomize the delay to have it picked randomly from 0us to 100us (with 1us step) ? It would remove a lot of the needs for arch-specific magic delay value. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel