All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Sean Christopherson <seanjc@google.com>
Cc: dvhart <dvhart@infradead.org>,
	"Russell King, ARM Linux" <linux@armlinux.org.uk>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>, Guo Ren <guoren@kernel.org>,
	Thomas Bogendoerfer <tsbogend@alpha.franken.de>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Heiko Carstens <hca@linux.ibm.com>, gor <gor@linux.ibm.com>,
	Christian Borntraeger <borntraeger@de.ibm.com>,
	rostedt <rostedt@goodmis.org>, Ingo Molnar <mingo@redhat.com>,
	Oleg Nesterov <oleg@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Peter Zijlstra <peterz@infradead.org>,
	Andy Lutomirski <luto@kernel.org>, paulmck <paulmck@kernel.org>,
	Boqun Feng <boqun.feng@gmail.com>,
	Paolo Bonzini <pbonzini@redhat.com>, shuah <shuah@kernel.org>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Paul Mackerras <paulus@samba.org>,
	linux-arm-kernel <linux-arm-kernel@lists.infradead.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	linux-csky <linux-csky@vger.kernel.org>,
	linux-mips <linux-mips@vger.kernel.org>,
	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>,
	linux-s390 <linux-s390@vger.kernel.org>,
	KVM list <kvm@vger.kernel.org>,
	linux-kselftest <linux-kselftest@vger.kernel.org>,
	Peter Foley <pefoley@google.com>,
	Shakeel Butt <shakeelb@google.com>,
	Ben Gardon <bgardon@google.com>
Subject: Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs
Date: Thu, 26 Aug 2021 14:42:12 -0400 (EDT)	[thread overview]
Message-ID: <1700758714.29394.1630003332081.JavaMail.zimbra@efficios.com> (raw)
In-Reply-To: <YSblqrrpKcORzilX@google.com>

----- On Aug 25, 2021, at 8:51 PM, Sean Christopherson seanjc@google.com wrote:

> On Mon, Aug 23, 2021, Mathieu Desnoyers wrote:
>> [ re-send to Darren Hart ]
>> 
>> ----- On Aug 23, 2021, at 11:18 AM, Mathieu Desnoyers
>> mathieu.desnoyers@efficios.com wrote:
>> 
>> > ----- On Aug 20, 2021, at 6:50 PM, Sean Christopherson seanjc@google.com wrote:
>> > 
>> >> Add a test to verify an rseq's CPU ID is updated correctly if the task is
>> >> migrated while the kernel is handling KVM_RUN.  This is a regression test
>> >> for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
>> >> to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
>> >> without updating rseq, leading to a stale CPU ID and other badness.
>> >> 
>> > 
>> > [...]
>> > 
>> > +#define RSEQ_SIG 0xdeadbeef
>> > 
>> > Is there any reason for defining a custom signature rather than including
>> > tools/testing/selftests/rseq/rseq.h ? This should take care of including
>> > the proper architecture header which will define the appropriate signature.
>> > 
>> > Arguably you don't define rseq critical sections in this test per se, but
>> > I'm wondering why the custom signature here.
> 
> Partly to avoid taking a dependency on rseq.h, and partly to try to call out
> that
> the test doesn't actually do any rseq critical sections.

It might be good to add a comment near this define stating this then, so nobody
attempts to copy this as an example.

> 
>> > [...]
>> > 
>> >> +
>> >> +static void *migration_worker(void *ign)
>> >> +{
>> >> +	cpu_set_t allowed_mask;
>> >> +	int r, i, nr_cpus, cpu;
>> >> +
>> >> +	CPU_ZERO(&allowed_mask);
>> >> +
>> >> +	nr_cpus = CPU_COUNT(&possible_mask);
>> >> +
>> >> +	for (i = 0; i < 20000; i++) {
>> >> +		cpu = i % nr_cpus;
>> >> +		if (!CPU_ISSET(cpu, &possible_mask))
>> >> +			continue;
>> >> +
>> >> +		CPU_SET(cpu, &allowed_mask);
>> >> +
>> >> +		/*
>> >> +		 * Bump the sequence count twice to allow the reader to detect
>> >> +		 * that a migration may have occurred in between rseq and sched
>> >> +		 * CPU ID reads.  An odd sequence count indicates a migration
>> >> +		 * is in-progress, while a completely different count indicates
>> >> +		 * a migration occurred since the count was last read.
>> >> +		 */
>> >> +		atomic_inc(&seq_cnt);
>> > 
>> > So technically this atomic_inc contains the required barriers because the
>> > selftests implementation uses "__sync_add_and_fetch(&addr->val, 1)". But
>> > it's rather odd that the semantic differs from the kernel implementation in
>> > terms of memory barriers: the kernel implementation of atomic_inc
>> > guarantees no memory barriers, but this one happens to provide full
>> > barriers pretty much by accident (selftests futex/include/atomic.h
>> > documents no such guarantee).
> 
> Yeah, I got quite lost trying to figure out what atomics the test would actually
> end up with.

At the very least, until things are clarified in the selftests atomic header, I would
recommend adding a comment stating which memory barriers are required alongside each
use of atomic_inc here. I would even prefer that we add extra (currently unneeded)
write barriers to make extra sure that this stays documented. Performance of the write-side
does not matter much here.

> 
>> > If this full barrier guarantee is indeed provided by the selftests atomic.h
>> > header, I would really like a comment stating that in the atomic.h header
>> > so the carpet is not pulled from under our feet by a future optimization.
>> > 
>> > 
>> >> +		r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask);
>> >> +		TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
>> >> +			    errno, strerror(errno));
>> >> +		atomic_inc(&seq_cnt);
>> >> +
>> >> +		CPU_CLR(cpu, &allowed_mask);
>> >> +
>> >> +		/*
>> >> +		 * Let the read-side get back into KVM_RUN to improve the odds
>> >> +		 * of task migration coinciding with KVM's run loop.
>> > 
>> > This comment should be about increasing the odds of letting the seqlock
>> > read-side complete. Otherwise, the delay between the two back-to-back
>> > atomic_inc is so small that the seqlock read-side may never have time to
>> > complete the reading the rseq cpu id and the sched_getcpu() call, and can
>> > retry forever.
> 
> Hmm, but that's not why there's a delay.  I'm not arguing that a livelock isn't
> possible (though that syscall would have to be screaming fast),

I don't think we have the same understanding of the livelock scenario. AFAIU the livelock
would be caused by a too-small delay between the two consecutive atomic_inc() from one
loop iteration to the next compared to the time it takes to perform a read-side critical
section of the seqlock. Back-to-back atomic_inc can be performed very quickly, so I
doubt that the sched_getcpu implementation have good odds to be fast enough to complete
in that narrow window, leading to lots of read seqlock retry.

> but the primary
> motivation is very much to allow the read-side enough time to get back into KVM
> proper.

I'm puzzled by your statement. AFAIU, let's say we don't have the delay, then we
have:

migration thread                             KVM_RUN/read-side thread
-----------------------------------------------------------------------------------
                                             - ioctl(KVM_RUN)
- atomic_inc_seq_cst(&seqcnt)
- sched_setaffinity
- atomic_inc_seq_cst(&seqcnt)
                                             - a = atomic_load(&seqcnt) & ~1
                                             - smp_rmb()
                                             - b = LOAD_ONCE(__rseq_abi->cpu_id);
                                             - sched_getcpu()
                                             - smp_rmb()
                                             - re-load seqcnt/compare (succeeds)
                                               - Can only succeed if entire read-side happens while the seqcnt
                                                 is in an even numbered state.
                                             - if (a != b) abort()
  /* no delay. Even counter state is very
     short. */
- atomic_inc_seq_cst(&seqcnt)
  /* Let's suppose the lack of delay causes the
     setaffinity to complete too early compared
     with KVM_RUN ioctl */
- sched_setaffinity
- atomic_inc_seq_cst(&seqcnt)

  /* no delay. Even counter state is very
     short. */
- atomic_inc_seq_cst(&seqcnt)
  /* Then a setaffinity from a following
     migration thread loop will run
     concurrently with KVM_RUN */
                                             - ioctl(KVM_RUN)
- sched_setaffinity
- atomic_inc_seq_cst(&seqcnt)

As pointed out here, if the first setaffinity runs too early compared with KVM_RUN,
a following setaffinity will run concurrently with it. However, the fact that 
the even counter state is very short may very well hurt progress of the read seqlock.

> 
> To encounter the bug, TIF_NOTIFY_RESUME has to be recognized by KVM in its run
> loop, i.e. sched_setaffinity() must induce task migration after the read-side
> has
> invoked ioctl(KVM_RUN).

No argument here.

> 
>> > I'm wondering if 1 microsecond is sufficient on other architectures as
>> > well.
> 
> I'm definitely wondering that as well :-)
> 
>> > One alternative way to make this depend less on the architecture's
>> > implementation of sched_getcpu (whether it's a vDSO, or goes through a
>> > syscall) would be to read the rseq cpu id and call sched_getcpu a few times
>> > (e.g. 3 times) in the migration thread rather than use usleep, and throw
>> > away the value read. This would ensure the delay is appropriate on all
>> > architectures.
> 
> As above, I think an arbitrary delay is required regardless of how fast
> sched_getcpu() can execute.  One thought would be to do sched_getcpu() _and_
> usleep() to account for sched_getcpu() overhead and to satisfy the KVM_RUN part,
> but I don't know that that adds meaningful value.
> 
> The real test is if someone could see if the bug repros on non-x86 hardware...

As I explain in the scenario above, I don't think we agree on the reason why the
delay is required. One way to confirm this would be to run the code without delay,
and count how many seqcnt read-side succeed vs fail. We can then compare that with
a run with a 1us delay between the migration thread loops. This data should help us
come to a common understanding of the delay's role.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

WARNING: multiple messages have this Message-ID (diff)
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Sean Christopherson <seanjc@google.com>
Cc: KVM list <kvm@vger.kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Will Deacon <will@kernel.org>, Guo Ren <guoren@kernel.org>,
	linux-kselftest <linux-kselftest@vger.kernel.org>,
	Ben Gardon <bgardon@google.com>, shuah <shuah@kernel.org>,
	Paul Mackerras <paulus@samba.org>,
	linux-s390 <linux-s390@vger.kernel.org>, gor <gor@linux.ibm.com>,
	"Russell King, ARM Linux" <linux@armlinux.org.uk>,
	linux-csky <linux-csky@vger.kernel.org>,
	Christian Borntraeger <borntraeger@de.ibm.com>,
	Ingo Molnar <mingo@redhat.com>, dvhart <dvhart@infradead.org>,
	linux-mips <linux-mips@vger.kernel.org>,
	Boqun Feng <boqun.feng@gmail.com>, paulmck <paulmck@kernel.org>,
	Heiko Carstens <hca@linux.ibm.com>, rostedt <rostedt@goodmis.org>,
	Shakeel Butt <shakeelb@google.com>,
	Andy Lutomirski <luto@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Peter Foley <pefoley@google.com>,
	linux-arm-kernel <linux-arm-kernel@lists.infradead.org>,
	Thomas Bogendoerfer <tsbogend@alpha.franken.de>,
	Oleg Nesterov <oleg@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Subject: Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs
Date: Thu, 26 Aug 2021 14:42:12 -0400 (EDT)	[thread overview]
Message-ID: <1700758714.29394.1630003332081.JavaMail.zimbra@efficios.com> (raw)
In-Reply-To: <YSblqrrpKcORzilX@google.com>

----- On Aug 25, 2021, at 8:51 PM, Sean Christopherson seanjc@google.com wrote:

> On Mon, Aug 23, 2021, Mathieu Desnoyers wrote:
>> [ re-send to Darren Hart ]
>> 
>> ----- On Aug 23, 2021, at 11:18 AM, Mathieu Desnoyers
>> mathieu.desnoyers@efficios.com wrote:
>> 
>> > ----- On Aug 20, 2021, at 6:50 PM, Sean Christopherson seanjc@google.com wrote:
>> > 
>> >> Add a test to verify an rseq's CPU ID is updated correctly if the task is
>> >> migrated while the kernel is handling KVM_RUN.  This is a regression test
>> >> for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
>> >> to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
>> >> without updating rseq, leading to a stale CPU ID and other badness.
>> >> 
>> > 
>> > [...]
>> > 
>> > +#define RSEQ_SIG 0xdeadbeef
>> > 
>> > Is there any reason for defining a custom signature rather than including
>> > tools/testing/selftests/rseq/rseq.h ? This should take care of including
>> > the proper architecture header which will define the appropriate signature.
>> > 
>> > Arguably you don't define rseq critical sections in this test per se, but
>> > I'm wondering why the custom signature here.
> 
> Partly to avoid taking a dependency on rseq.h, and partly to try to call out
> that
> the test doesn't actually do any rseq critical sections.

It might be good to add a comment near this define stating this then, so nobody
attempts to copy this as an example.

> 
>> > [...]
>> > 
>> >> +
>> >> +static void *migration_worker(void *ign)
>> >> +{
>> >> +	cpu_set_t allowed_mask;
>> >> +	int r, i, nr_cpus, cpu;
>> >> +
>> >> +	CPU_ZERO(&allowed_mask);
>> >> +
>> >> +	nr_cpus = CPU_COUNT(&possible_mask);
>> >> +
>> >> +	for (i = 0; i < 20000; i++) {
>> >> +		cpu = i % nr_cpus;
>> >> +		if (!CPU_ISSET(cpu, &possible_mask))
>> >> +			continue;
>> >> +
>> >> +		CPU_SET(cpu, &allowed_mask);
>> >> +
>> >> +		/*
>> >> +		 * Bump the sequence count twice to allow the reader to detect
>> >> +		 * that a migration may have occurred in between rseq and sched
>> >> +		 * CPU ID reads.  An odd sequence count indicates a migration
>> >> +		 * is in-progress, while a completely different count indicates
>> >> +		 * a migration occurred since the count was last read.
>> >> +		 */
>> >> +		atomic_inc(&seq_cnt);
>> > 
>> > So technically this atomic_inc contains the required barriers because the
>> > selftests implementation uses "__sync_add_and_fetch(&addr->val, 1)". But
>> > it's rather odd that the semantic differs from the kernel implementation in
>> > terms of memory barriers: the kernel implementation of atomic_inc
>> > guarantees no memory barriers, but this one happens to provide full
>> > barriers pretty much by accident (selftests futex/include/atomic.h
>> > documents no such guarantee).
> 
> Yeah, I got quite lost trying to figure out what atomics the test would actually
> end up with.

At the very least, until things are clarified in the selftests atomic header, I would
recommend adding a comment stating which memory barriers are required alongside each
use of atomic_inc here. I would even prefer that we add extra (currently unneeded)
write barriers to make extra sure that this stays documented. Performance of the write-side
does not matter much here.

> 
>> > If this full barrier guarantee is indeed provided by the selftests atomic.h
>> > header, I would really like a comment stating that in the atomic.h header
>> > so the carpet is not pulled from under our feet by a future optimization.
>> > 
>> > 
>> >> +		r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask);
>> >> +		TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
>> >> +			    errno, strerror(errno));
>> >> +		atomic_inc(&seq_cnt);
>> >> +
>> >> +		CPU_CLR(cpu, &allowed_mask);
>> >> +
>> >> +		/*
>> >> +		 * Let the read-side get back into KVM_RUN to improve the odds
>> >> +		 * of task migration coinciding with KVM's run loop.
>> > 
>> > This comment should be about increasing the odds of letting the seqlock
>> > read-side complete. Otherwise, the delay between the two back-to-back
>> > atomic_inc is so small that the seqlock read-side may never have time to
>> > complete the reading the rseq cpu id and the sched_getcpu() call, and can
>> > retry forever.
> 
> Hmm, but that's not why there's a delay.  I'm not arguing that a livelock isn't
> possible (though that syscall would have to be screaming fast),

I don't think we have the same understanding of the livelock scenario. AFAIU the livelock
would be caused by a too-small delay between the two consecutive atomic_inc() from one
loop iteration to the next compared to the time it takes to perform a read-side critical
section of the seqlock. Back-to-back atomic_inc can be performed very quickly, so I
doubt that the sched_getcpu implementation have good odds to be fast enough to complete
in that narrow window, leading to lots of read seqlock retry.

> but the primary
> motivation is very much to allow the read-side enough time to get back into KVM
> proper.

I'm puzzled by your statement. AFAIU, let's say we don't have the delay, then we
have:

migration thread                             KVM_RUN/read-side thread
-----------------------------------------------------------------------------------
                                             - ioctl(KVM_RUN)
- atomic_inc_seq_cst(&seqcnt)
- sched_setaffinity
- atomic_inc_seq_cst(&seqcnt)
                                             - a = atomic_load(&seqcnt) & ~1
                                             - smp_rmb()
                                             - b = LOAD_ONCE(__rseq_abi->cpu_id);
                                             - sched_getcpu()
                                             - smp_rmb()
                                             - re-load seqcnt/compare (succeeds)
                                               - Can only succeed if entire read-side happens while the seqcnt
                                                 is in an even numbered state.
                                             - if (a != b) abort()
  /* no delay. Even counter state is very
     short. */
- atomic_inc_seq_cst(&seqcnt)
  /* Let's suppose the lack of delay causes the
     setaffinity to complete too early compared
     with KVM_RUN ioctl */
- sched_setaffinity
- atomic_inc_seq_cst(&seqcnt)

  /* no delay. Even counter state is very
     short. */
- atomic_inc_seq_cst(&seqcnt)
  /* Then a setaffinity from a following
     migration thread loop will run
     concurrently with KVM_RUN */
                                             - ioctl(KVM_RUN)
- sched_setaffinity
- atomic_inc_seq_cst(&seqcnt)

As pointed out here, if the first setaffinity runs too early compared with KVM_RUN,
a following setaffinity will run concurrently with it. However, the fact that 
the even counter state is very short may very well hurt progress of the read seqlock.

> 
> To encounter the bug, TIF_NOTIFY_RESUME has to be recognized by KVM in its run
> loop, i.e. sched_setaffinity() must induce task migration after the read-side
> has
> invoked ioctl(KVM_RUN).

No argument here.

> 
>> > I'm wondering if 1 microsecond is sufficient on other architectures as
>> > well.
> 
> I'm definitely wondering that as well :-)
> 
>> > One alternative way to make this depend less on the architecture's
>> > implementation of sched_getcpu (whether it's a vDSO, or goes through a
>> > syscall) would be to read the rseq cpu id and call sched_getcpu a few times
>> > (e.g. 3 times) in the migration thread rather than use usleep, and throw
>> > away the value read. This would ensure the delay is appropriate on all
>> > architectures.
> 
> As above, I think an arbitrary delay is required regardless of how fast
> sched_getcpu() can execute.  One thought would be to do sched_getcpu() _and_
> usleep() to account for sched_getcpu() overhead and to satisfy the KVM_RUN part,
> but I don't know that that adds meaningful value.
> 
> The real test is if someone could see if the bug repros on non-x86 hardware...

As I explain in the scenario above, I don't think we agree on the reason why the
delay is required. One way to confirm this would be to run the code without delay,
and count how many seqcnt read-side succeed vs fail. We can then compare that with
a run with a 1us delay between the migration thread loops. This data should help us
come to a common understanding of the delay's role.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

WARNING: multiple messages have this Message-ID (diff)
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Sean Christopherson <seanjc@google.com>
Cc: dvhart <dvhart@infradead.org>,
	 "Russell King, ARM Linux" <linux@armlinux.org.uk>,
	 Catalin Marinas <catalin.marinas@arm.com>,
	 Will Deacon <will@kernel.org>, Guo Ren <guoren@kernel.org>,
	 Thomas Bogendoerfer <tsbogend@alpha.franken.de>,
	 Michael Ellerman <mpe@ellerman.id.au>,
	 Heiko Carstens <hca@linux.ibm.com>, gor <gor@linux.ibm.com>,
	 Christian Borntraeger <borntraeger@de.ibm.com>,
	 rostedt <rostedt@goodmis.org>, Ingo Molnar <mingo@redhat.com>,
	 Oleg Nesterov <oleg@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	 Peter Zijlstra <peterz@infradead.org>,
	 Andy Lutomirski <luto@kernel.org>, paulmck <paulmck@kernel.org>,
	 Boqun Feng <boqun.feng@gmail.com>,
	 Paolo Bonzini <pbonzini@redhat.com>, shuah <shuah@kernel.org>,
	 Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	 Paul Mackerras <paulus@samba.org>,
	 linux-arm-kernel <linux-arm-kernel@lists.infradead.org>,
	 linux-kernel <linux-kernel@vger.kernel.org>,
	 linux-csky <linux-csky@vger.kernel.org>,
	 linux-mips <linux-mips@vger.kernel.org>,
	 linuxppc-dev <linuxppc-dev@lists.ozlabs.org>,
	 linux-s390 <linux-s390@vger.kernel.org>,
	KVM list <kvm@vger.kernel.org>,
	 linux-kselftest <linux-kselftest@vger.kernel.org>,
	 Peter Foley <pefoley@google.com>,
	Shakeel Butt <shakeelb@google.com>,
	 Ben Gardon <bgardon@google.com>
Subject: Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs
Date: Thu, 26 Aug 2021 14:42:12 -0400 (EDT)	[thread overview]
Message-ID: <1700758714.29394.1630003332081.JavaMail.zimbra@efficios.com> (raw)
In-Reply-To: <YSblqrrpKcORzilX@google.com>

----- On Aug 25, 2021, at 8:51 PM, Sean Christopherson seanjc@google.com wrote:

> On Mon, Aug 23, 2021, Mathieu Desnoyers wrote:
>> [ re-send to Darren Hart ]
>> 
>> ----- On Aug 23, 2021, at 11:18 AM, Mathieu Desnoyers
>> mathieu.desnoyers@efficios.com wrote:
>> 
>> > ----- On Aug 20, 2021, at 6:50 PM, Sean Christopherson seanjc@google.com wrote:
>> > 
>> >> Add a test to verify an rseq's CPU ID is updated correctly if the task is
>> >> migrated while the kernel is handling KVM_RUN.  This is a regression test
>> >> for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
>> >> to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
>> >> without updating rseq, leading to a stale CPU ID and other badness.
>> >> 
>> > 
>> > [...]
>> > 
>> > +#define RSEQ_SIG 0xdeadbeef
>> > 
>> > Is there any reason for defining a custom signature rather than including
>> > tools/testing/selftests/rseq/rseq.h ? This should take care of including
>> > the proper architecture header which will define the appropriate signature.
>> > 
>> > Arguably you don't define rseq critical sections in this test per se, but
>> > I'm wondering why the custom signature here.
> 
> Partly to avoid taking a dependency on rseq.h, and partly to try to call out
> that
> the test doesn't actually do any rseq critical sections.

It might be good to add a comment near this define stating this then, so nobody
attempts to copy this as an example.

> 
>> > [...]
>> > 
>> >> +
>> >> +static void *migration_worker(void *ign)
>> >> +{
>> >> +	cpu_set_t allowed_mask;
>> >> +	int r, i, nr_cpus, cpu;
>> >> +
>> >> +	CPU_ZERO(&allowed_mask);
>> >> +
>> >> +	nr_cpus = CPU_COUNT(&possible_mask);
>> >> +
>> >> +	for (i = 0; i < 20000; i++) {
>> >> +		cpu = i % nr_cpus;
>> >> +		if (!CPU_ISSET(cpu, &possible_mask))
>> >> +			continue;
>> >> +
>> >> +		CPU_SET(cpu, &allowed_mask);
>> >> +
>> >> +		/*
>> >> +		 * Bump the sequence count twice to allow the reader to detect
>> >> +		 * that a migration may have occurred in between rseq and sched
>> >> +		 * CPU ID reads.  An odd sequence count indicates a migration
>> >> +		 * is in-progress, while a completely different count indicates
>> >> +		 * a migration occurred since the count was last read.
>> >> +		 */
>> >> +		atomic_inc(&seq_cnt);
>> > 
>> > So technically this atomic_inc contains the required barriers because the
>> > selftests implementation uses "__sync_add_and_fetch(&addr->val, 1)". But
>> > it's rather odd that the semantic differs from the kernel implementation in
>> > terms of memory barriers: the kernel implementation of atomic_inc
>> > guarantees no memory barriers, but this one happens to provide full
>> > barriers pretty much by accident (selftests futex/include/atomic.h
>> > documents no such guarantee).
> 
> Yeah, I got quite lost trying to figure out what atomics the test would actually
> end up with.

At the very least, until things are clarified in the selftests atomic header, I would
recommend adding a comment stating which memory barriers are required alongside each
use of atomic_inc here. I would even prefer that we add extra (currently unneeded)
write barriers to make extra sure that this stays documented. Performance of the write-side
does not matter much here.

> 
>> > If this full barrier guarantee is indeed provided by the selftests atomic.h
>> > header, I would really like a comment stating that in the atomic.h header
>> > so the carpet is not pulled from under our feet by a future optimization.
>> > 
>> > 
>> >> +		r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask);
>> >> +		TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
>> >> +			    errno, strerror(errno));
>> >> +		atomic_inc(&seq_cnt);
>> >> +
>> >> +		CPU_CLR(cpu, &allowed_mask);
>> >> +
>> >> +		/*
>> >> +		 * Let the read-side get back into KVM_RUN to improve the odds
>> >> +		 * of task migration coinciding with KVM's run loop.
>> > 
>> > This comment should be about increasing the odds of letting the seqlock
>> > read-side complete. Otherwise, the delay between the two back-to-back
>> > atomic_inc is so small that the seqlock read-side may never have time to
>> > complete the reading the rseq cpu id and the sched_getcpu() call, and can
>> > retry forever.
> 
> Hmm, but that's not why there's a delay.  I'm not arguing that a livelock isn't
> possible (though that syscall would have to be screaming fast),

I don't think we have the same understanding of the livelock scenario. AFAIU the livelock
would be caused by a too-small delay between the two consecutive atomic_inc() from one
loop iteration to the next compared to the time it takes to perform a read-side critical
section of the seqlock. Back-to-back atomic_inc can be performed very quickly, so I
doubt that the sched_getcpu implementation have good odds to be fast enough to complete
in that narrow window, leading to lots of read seqlock retry.

> but the primary
> motivation is very much to allow the read-side enough time to get back into KVM
> proper.

I'm puzzled by your statement. AFAIU, let's say we don't have the delay, then we
have:

migration thread                             KVM_RUN/read-side thread
-----------------------------------------------------------------------------------
                                             - ioctl(KVM_RUN)
- atomic_inc_seq_cst(&seqcnt)
- sched_setaffinity
- atomic_inc_seq_cst(&seqcnt)
                                             - a = atomic_load(&seqcnt) & ~1
                                             - smp_rmb()
                                             - b = LOAD_ONCE(__rseq_abi->cpu_id);
                                             - sched_getcpu()
                                             - smp_rmb()
                                             - re-load seqcnt/compare (succeeds)
                                               - Can only succeed if entire read-side happens while the seqcnt
                                                 is in an even numbered state.
                                             - if (a != b) abort()
  /* no delay. Even counter state is very
     short. */
- atomic_inc_seq_cst(&seqcnt)
  /* Let's suppose the lack of delay causes the
     setaffinity to complete too early compared
     with KVM_RUN ioctl */
- sched_setaffinity
- atomic_inc_seq_cst(&seqcnt)

  /* no delay. Even counter state is very
     short. */
- atomic_inc_seq_cst(&seqcnt)
  /* Then a setaffinity from a following
     migration thread loop will run
     concurrently with KVM_RUN */
                                             - ioctl(KVM_RUN)
- sched_setaffinity
- atomic_inc_seq_cst(&seqcnt)

As pointed out here, if the first setaffinity runs too early compared with KVM_RUN,
a following setaffinity will run concurrently with it. However, the fact that 
the even counter state is very short may very well hurt progress of the read seqlock.

> 
> To encounter the bug, TIF_NOTIFY_RESUME has to be recognized by KVM in its run
> loop, i.e. sched_setaffinity() must induce task migration after the read-side
> has
> invoked ioctl(KVM_RUN).

No argument here.

> 
>> > I'm wondering if 1 microsecond is sufficient on other architectures as
>> > well.
> 
> I'm definitely wondering that as well :-)
> 
>> > One alternative way to make this depend less on the architecture's
>> > implementation of sched_getcpu (whether it's a vDSO, or goes through a
>> > syscall) would be to read the rseq cpu id and call sched_getcpu a few times
>> > (e.g. 3 times) in the migration thread rather than use usleep, and throw
>> > away the value read. This would ensure the delay is appropriate on all
>> > architectures.
> 
> As above, I think an arbitrary delay is required regardless of how fast
> sched_getcpu() can execute.  One thought would be to do sched_getcpu() _and_
> usleep() to account for sched_getcpu() overhead and to satisfy the KVM_RUN part,
> but I don't know that that adds meaningful value.
> 
> The real test is if someone could see if the bug repros on non-x86 hardware...

As I explain in the scenario above, I don't think we agree on the reason why the
delay is required. One way to confirm this would be to run the code without delay,
and count how many seqcnt read-side succeed vs fail. We can then compare that with
a run with a 1us delay between the migration thread loops. This data should help us
come to a common understanding of the delay's role.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2021-08-26 18:42 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-20 22:49 [PATCH v2 0/5] KVM: rseq: Fix and a test for a KVM+rseq bug Sean Christopherson
2021-08-20 22:49 ` Sean Christopherson
2021-08-20 22:49 ` Sean Christopherson
2021-08-20 22:49 ` [PATCH v2 1/5] KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest Sean Christopherson
2021-08-20 22:49   ` Sean Christopherson
2021-08-20 22:49   ` Sean Christopherson
2021-08-23 15:00   ` Mathieu Desnoyers
2021-08-23 15:00     ` Mathieu Desnoyers
2021-08-23 15:00     ` Mathieu Desnoyers
2021-08-20 22:49 ` [PATCH v2 2/5] entry: rseq: Call rseq_handle_notify_resume() in tracehook_notify_resume() Sean Christopherson
2021-08-20 22:49   ` Sean Christopherson
2021-08-20 22:49   ` Sean Christopherson
2021-08-20 22:50 ` [PATCH v2 3/5] tools: Move x86 syscall number fallbacks to .../uapi/ Sean Christopherson
2021-08-20 22:50   ` Sean Christopherson
2021-08-20 22:50   ` Sean Christopherson
2021-08-20 22:50 ` [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs Sean Christopherson
2021-08-20 22:50   ` Sean Christopherson
2021-08-20 22:50   ` Sean Christopherson
2021-08-23 15:18   ` Mathieu Desnoyers
2021-08-23 15:18     ` Mathieu Desnoyers
2021-08-23 15:18     ` Mathieu Desnoyers
2021-08-23 15:20     ` Mathieu Desnoyers
2021-08-23 15:20       ` Mathieu Desnoyers
2021-08-23 15:20       ` Mathieu Desnoyers
2021-08-26  0:51       ` Sean Christopherson
2021-08-26  0:51         ` Sean Christopherson
2021-08-26  0:51         ` Sean Christopherson
2021-08-26 18:42         ` Mathieu Desnoyers [this message]
2021-08-26 18:42           ` Mathieu Desnoyers
2021-08-26 18:42           ` Mathieu Desnoyers
2021-08-26 23:54           ` Sean Christopherson
2021-08-26 23:54             ` Sean Christopherson
2021-08-26 23:54             ` Sean Christopherson
2021-08-27 19:09             ` Mathieu Desnoyers
2021-08-27 19:09               ` Mathieu Desnoyers
2021-08-27 19:09               ` Mathieu Desnoyers
2021-08-27 23:23               ` Sean Christopherson
2021-08-27 23:23                 ` Sean Christopherson
2021-08-27 23:23                 ` Sean Christopherson
2021-08-28  0:06                 ` Mathieu Desnoyers
2021-08-28  0:06                   ` Mathieu Desnoyers
2021-08-28  0:06                   ` Mathieu Desnoyers
2021-08-20 22:50 ` [PATCH v2 5/5] KVM: selftests: Remove __NR_userfaultfd syscall fallback Sean Christopherson
2021-08-20 22:50   ` Sean Christopherson
2021-08-20 22:50   ` Sean Christopherson
2021-08-23 23:46   ` Ben Gardon
2021-08-23 23:46     ` Ben Gardon
2021-08-23 23:46     ` Ben Gardon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1700758714.29394.1630003332081.JavaMail.zimbra@efficios.com \
    --to=mathieu.desnoyers@efficios.com \
    --cc=benh@kernel.crashing.org \
    --cc=bgardon@google.com \
    --cc=boqun.feng@gmail.com \
    --cc=borntraeger@de.ibm.com \
    --cc=catalin.marinas@arm.com \
    --cc=dvhart@infradead.org \
    --cc=gor@linux.ibm.com \
    --cc=guoren@kernel.org \
    --cc=hca@linux.ibm.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-csky@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mips@vger.kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=linux@armlinux.org.uk \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=mpe@ellerman.id.au \
    --cc=oleg@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=paulus@samba.org \
    --cc=pbonzini@redhat.com \
    --cc=pefoley@google.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=seanjc@google.com \
    --cc=shakeelb@google.com \
    --cc=shuah@kernel.org \
    --cc=tglx@linutronix.de \
    --cc=tsbogend@alpha.franken.de \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.