linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: 5.13-rt1 + KVM = WARNING: at fs/eventfd.c:74 eventfd_signal()
       [not found]               ` <a2f3f9ac-dac2-eadc-269e-91652d78ebd3@redhat.com>
@ 2021-07-18 12:42                 ` Hillf Danton
  2021-07-19 15:38                   ` Paolo Bonzini
  0 siblings, 1 reply; 13+ messages in thread
From: Hillf Danton @ 2021-07-18 12:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Thomas Gleixner, Sebastian Andrzej Siewior, Michael S. Tsirkin,
	linux-mm, LKML, Al Viro

On Fri, 16 Jul 2021 13:55:27 +0200 Paolo Bonzini wrote:
>On 16/07/21 11:37, Hillf Danton wrote:
>> On Fri, 16 Jul 2021 09:59:15 +0200 Paolo Bonzini wrote:
>>> * the warning only occurs if preemption occurs during the
>>> spin_lock_irqsave critical section (and therefore it can only occur in
>>> PREEMPT_RT kernels)
>> 
>> With that lock held, no waitqueue entry can be added on to the WQ - IOW no
>> wakeup will go stray.
>> 
>>> * the warning causes an early return 0 that messes up the VM's networking
>> 
>> Is the messup due to the zero or wakeup?
>
>It's caused by the missing wakeup, i.e. eventfd_signal not really 
>signaling anything.

There are two cases of write_seqcount_begin in x/virt/kvm/eventfd.c, and
in kvm_irqfd_deassign() it is surrounded by spin_lock_irq(&kvm->irqfds.lock)
that also protects irqfd_update().

What isnt clear is if the risk is zero that either case can be preempted by
seqcount reader. That risk may end up with the livelock described in
x/Documentation/locking/seqlock.rst.

+A sequence counter write side critical section must never be preempted
+or interrupted by read side sections. Otherwise the reader will spin for
+the entire scheduler tick due to the odd sequence count value and the
+interrupted writer. If that reader belongs to a real-time scheduling
+class, it can spin forever and the kernel will livelock.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 5.13-rt1 + KVM = WARNING: at fs/eventfd.c:74 eventfd_signal()
  2021-07-18 12:42                 ` 5.13-rt1 + KVM = WARNING: at fs/eventfd.c:74 eventfd_signal() Hillf Danton
@ 2021-07-19 15:38                   ` Paolo Bonzini
  2021-07-21  7:04                     ` Hillf Danton
  0 siblings, 1 reply; 13+ messages in thread
From: Paolo Bonzini @ 2021-07-19 15:38 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Thomas Gleixner, Sebastian Andrzej Siewior, Michael S. Tsirkin,
	linux-mm, LKML, Al Viro

On 18/07/21 14:42, Hillf Danton wrote:
>> It's caused by the missing wakeup, i.e. eventfd_signal not really
>> signaling anything.
> 
> Can you please point me to the waiters in the mainline?

It's irqfd_wakeup.

> There are two cases of write_seqcount_begin in x/virt/kvm/eventfd.c, and
> in kvm_irqfd_deassign() it is surrounded by spin_lock_irq(&kvm->irqfds.lock)
> that also protects irqfd_update().
> 
> What isnt clear is if the risk is zero that either case can be preempted by
> seqcount reader. That risk may end up with the livelock described in
> x/Documentation/locking/seqlock.rst.

Since the introduction of seqcount_spinlock_t, the writers automatically 
disable preemption.  This is definitely the right thing in this case 
where the seqcount writers are small enough, and the readers are hot 
enough, that using a local lock would be too heavyweight.

Without that, the livelock would be possible, though very unlikely.  In 
practice seqcount updates should only happen while the producer is 
quiescent; and also the seqcount readers and writers will often be 
pinned to separate CPUs.

Paolo

> +A sequence counter write side critical section must never be preempted
> +or interrupted by read side sections. Otherwise the reader will spin for
> +the entire scheduler tick due to the odd sequence count value and the
> +interrupted writer. If that reader belongs to a real-time scheduling
> +class, it can spin forever and the kernel will livelock.
> 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 5.13-rt1 + KVM = WARNING: at fs/eventfd.c:74 eventfd_signal()
  2021-07-19 15:38                   ` Paolo Bonzini
@ 2021-07-21  7:04                     ` Hillf Danton
  2021-07-21  7:25                       ` Thomas Gleixner
  0 siblings, 1 reply; 13+ messages in thread
From: Hillf Danton @ 2021-07-21  7:04 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Thomas Gleixner, Sebastian Andrzej Siewior, Michael S. Tsirkin,
	linux-mm, LKML, Al Viro

On Mon, 19 Jul 2021 17:38:45 +0200 Paolo Bonzini wrote:
>On 18/07/21 14:42, Hillf Danton wrote:
>>> It's caused by the missing wakeup, i.e. eventfd_signal not really
>>> signaling anything.
>> 
>> Can you please point me to the waiters in the mainline?
>
>It's irqfd_wakeup.

Thanks for your light.

With PREEMPT_RT put aside, the race looks like the following.

	CPU0		CPU1		CPU2
	----		----		----
	lock waitqueue
	wake up waiters
	unlock waitqueue

			lock waitqueue
			no waiter
			unlock waitqueue

					lock waitqueue
					add waiter
					unlock waitqueue

If the missing wakeup on CPU1 is bloody critical to the waiter added on CPU2
then the size of race window is supposed to play magic. The race window can
be simulated by giving up wakeup if trylock fails.

With PREEMPT_RT before your patch, eventfd_wake_count prevents the preempting
waker from acquiring the waitqueue lock and ends up with the race window
widened because of the certainty of missing wakeup.

	CPU0		CPU1
	----		----
	lock waitqueue
	wake
	up  <-- preempt
	waiters
	unlock waitqueue

			lock waitqueue
			add waiter
			unlock waitqueue

With PREEMPT_RT after your patch, the race window goes back to without
PREEMPT_RT because of no occurence of preemption.

But the preempting waker can not make sense without the waiter who is bloody
special. Why is it so in the first place? Or it is not at all but the race
existing from Monday to Friday.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 5.13-rt1 + KVM = WARNING: at fs/eventfd.c:74 eventfd_signal()
  2021-07-21  7:04                     ` Hillf Danton
@ 2021-07-21  7:25                       ` Thomas Gleixner
  2021-07-21 10:11                         ` Hillf Danton
  0 siblings, 1 reply; 13+ messages in thread
From: Thomas Gleixner @ 2021-07-21  7:25 UTC (permalink / raw)
  To: Hillf Danton, Paolo Bonzini
  Cc: Sebastian Andrzej Siewior, Michael S. Tsirkin, linux-mm, LKML, Al Viro

On Wed, Jul 21 2021 at 15:04, Hillf Danton wrote:
>
> But the preempting waker can not make sense without the waiter who is bloody
> special. Why is it so in the first place? Or it is not at all but the race
> existing from Monday to Friday.

See the large comment in eventfd_poll().


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 5.13-rt1 + KVM = WARNING: at fs/eventfd.c:74 eventfd_signal()
  2021-07-21  7:25                       ` Thomas Gleixner
@ 2021-07-21 10:11                         ` Hillf Danton
  2021-07-21 10:59                           ` Paolo Bonzini
  0 siblings, 1 reply; 13+ messages in thread
From: Hillf Danton @ 2021-07-21 10:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Hillf Danton, Paolo Bonzini, Sebastian Andrzej Siewior,
	Michael S. Tsirkin, linux-mm, LKML, Al Viro

On Wed, 21 Jul 2021 09:25:32 +0200 Thomas Gleixner wrote:
>On Wed, Jul 21 2021 at 15:04, Hillf Danton wrote:
>>
>> But the preempting waker can not make sense without the waiter who is bloody
>> special. Why is it so in the first place? Or it is not at all but the race
>> existing from Monday to Friday.
>
>See the large comment in eventfd_poll().

Is it likely for a reader to make eventfd_poll() return 0?

read	 *     poll                               write
----	 *     -----------------                  ------------
	 *     count = ctx->count (INVALID!)
	 *                                        lock ctx->qwh.lock
	 *                                        ctx->count += n
	 *                                        **waitqueue_active is false**
	 *                                        **no wake_up_locked_poll!**
	 *                                        unlock ctx->qwh.lock

lock ctx->qwh.lock
*cnt = (ctx->flags & EFD_SEMAPHORE) ? 1 : ctx->count;
ctx->count -= *cnt;
**waitqueue_active is false**
unlock ctx->qwh.lock

	 *     lock ctx->wqh.lock (in poll_wait)
	 *     __add_wait_queue
	 *     unlock ctx->wqh.lock
	 *     eventfd_poll returns 0
	 */
	count = READ_ONCE(ctx->count);


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 5.13-rt1 + KVM = WARNING: at fs/eventfd.c:74 eventfd_signal()
  2021-07-21 10:11                         ` Hillf Danton
@ 2021-07-21 10:59                           ` Paolo Bonzini
  2021-07-22  5:58                             ` Hillf Danton
  2021-07-23  2:23                             ` Hillf Danton
  0 siblings, 2 replies; 13+ messages in thread
From: Paolo Bonzini @ 2021-07-21 10:59 UTC (permalink / raw)
  To: Hillf Danton, Thomas Gleixner
  Cc: Sebastian Andrzej Siewior, Michael S. Tsirkin, linux-mm, LKML, Al Viro

On 21/07/21 12:11, Hillf Danton wrote:
> On Wed, 21 Jul 2021 09:25:32 +0200 Thomas Gleixner wrote:
>> On Wed, Jul 21 2021 at 15:04, Hillf Danton wrote:
>>>
>>> But the preempting waker can not make sense without the waiter who is bloody
>>> special. Why is it so in the first place? Or it is not at all but the race
>>> existing from Monday to Friday.
>>
>> See the large comment in eventfd_poll().
> 
> Is it likely for a reader to make eventfd_poll() return 0?
> 
> read	 *     poll                               write
> ----	 *     -----------------                  ------------
> 	 *     count = ctx->count (INVALID!)
> 	 *                                        lock ctx->qwh.lock
> 	 *                                        ctx->count += n
> 	 *                                        **waitqueue_active is false**
> 	 *                                        **no wake_up_locked_poll!**
> 	 *                                        unlock ctx->qwh.lock
> 
> lock ctx->qwh.lock
> *cnt = (ctx->flags & EFD_SEMAPHORE) ? 1 : ctx->count;
> ctx->count -= *cnt;
> **waitqueue_active is false**
> unlock ctx->qwh.lock
> 
> 	 *     lock ctx->wqh.lock (in poll_wait)
> 	 *     __add_wait_queue
> 	 *     unlock ctx->wqh.lock
> 	 *     eventfd_poll returns 0
> 	 */
> 	count = READ_ONCE(ctx->count);
> 

No, it's simply impossible.  The same comment explains why: "count = 
ctx->count" cannot move above poll_wait's locking of ctx->wqh.lock.

Paolo



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 5.13-rt1 + KVM = WARNING: at fs/eventfd.c:74 eventfd_signal()
  2021-07-21 10:59                           ` Paolo Bonzini
@ 2021-07-22  5:58                             ` Hillf Danton
  2021-07-23  2:23                             ` Hillf Danton
  1 sibling, 0 replies; 13+ messages in thread
From: Hillf Danton @ 2021-07-22  5:58 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Hillf Danton, Thomas Gleixner, Sebastian Andrzej Siewior,
	Michael S. Tsirkin, linux-mm, LKML, Al Viro

On Wed, 21 Jul 2021 12:59:39 +0200 Paolo Bonzini wrote:
>On 21/07/21 12:11, Hillf Danton wrote:
>> On Wed, 21 Jul 2021 09:25:32 +0200 Thomas Gleixner wrote:
>>> On Wed, Jul 21 2021 at 15:04, Hillf Danton wrote:
>>>>
>>>> But the preempting waker can not make sense without the waiter who is bloody
>>>> special. Why is it so in the first place? Or it is not at all but the race
>>>> existing from Monday to Friday.
>>>
>>> See the large comment in eventfd_poll().
>> 
>> Is it likely for a reader to make eventfd_poll() return 0?
>> 
>> read	 *     poll                               write
>> ----	 *     -----------------                  ------------
>> 	 *     count = ctx->count (INVALID!)
>> 	 *                                        lock ctx->qwh.lock
>> 	 *                                        ctx->count += n
>> 	 *                                        **waitqueue_active is false**
>> 	 *                                        **no wake_up_locked_poll!**
>> 	 *                                        unlock ctx->qwh.lock
>> 
>> lock ctx->qwh.lock
>> *cnt = (ctx->flags & EFD_SEMAPHORE) ? 1 : ctx->count;
>> ctx->count -= *cnt;
>> **waitqueue_active is false**
>> unlock ctx->qwh.lock
>> 
>> 	 *     lock ctx->wqh.lock (in poll_wait)
>> 	 *     __add_wait_queue
>> 	 *     unlock ctx->wqh.lock
>> 	 *     eventfd_poll returns 0
>> 	 */
>> 	count = READ_ONCE(ctx->count);
>> 
>
>No, it's simply impossible.  The same comment explains why: "count = 
>ctx->count" cannot move above poll_wait's locking of ctx->wqh.lock.

Ah good catch.

Hillf


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 5.13-rt1 + KVM = WARNING: at fs/eventfd.c:74 eventfd_signal()
  2021-07-21 10:59                           ` Paolo Bonzini
  2021-07-22  5:58                             ` Hillf Danton
@ 2021-07-23  2:23                             ` Hillf Danton
  2021-07-23  7:59                               ` Paolo Bonzini
  1 sibling, 1 reply; 13+ messages in thread
From: Hillf Danton @ 2021-07-23  2:23 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Hillf Danton, Thomas Gleixner, Sebastian Andrzej Siewior,
	Michael S. Tsirkin, linux-mm, LKML, Al Viro

On Wed, 21 Jul 2021 12:59:39 +0200 Paolo Bonzini wrote:
>On 21/07/21 12:11, Hillf Danton wrote:
>> On Wed, 21 Jul 2021 09:25:32 +0200 Thomas Gleixner wrote:
>>> On Wed, Jul 21 2021 at 15:04, Hillf Danton wrote:
>>>>
>>>> But the preempting waker can not make sense without the waiter who is bloody
>>>> special. Why is it so in the first place? Or it is not at all but the race
>>>> existing from Monday to Friday.
>>>
>>> See the large comment in eventfd_poll().
>> 
>> Is it likely for a reader to make eventfd_poll() return 0?
>> 
>> read	 *     poll                               write
>> ----	 *     -----------------                  ------------
>> 	 *     count = ctx->count (INVALID!)
>> 	 *                                        lock ctx->qwh.lock
>> 	 *                                        ctx->count += n
>> 	 *                                        **waitqueue_active is false**
>> 	 *                                        **no wake_up_locked_poll!**
>> 	 *                                        unlock ctx->qwh.lock
>> 
>> lock ctx->qwh.lock
>> *cnt = (ctx->flags & EFD_SEMAPHORE) ? 1 : ctx->count;
>> ctx->count -= *cnt;
>> **waitqueue_active is false**
>> unlock ctx->qwh.lock
>> 
>> 	 *     lock ctx->wqh.lock (in poll_wait)
>> 	 *     __add_wait_queue
>> 	 *     unlock ctx->wqh.lock
>> 	 *     eventfd_poll returns 0
>> 	 */
>> 	count = READ_ONCE(ctx->count);
>> 
>
>No, it's simply impossible.  The same comment explains why: "count = 
>ctx->count" cannot move above poll_wait's locking of ctx->wqh.lock.

Detect concurrent reader and writer by reading event counter before and
after poll_wait(), and determine feedback with the case of unstable
counter taken into account.

Cut the big comment as the added barriers speak for themselves.

+++ x/fs/eventfd.c
@@ -131,49 +131,20 @@ static __poll_t eventfd_poll(struct file
 {
 	struct eventfd_ctx *ctx = file->private_data;
 	__poll_t events = 0;
-	u64 count;
+	u64 c0, count;
+
+	c0 = ctx->count;
+	smp_rmb();
 
 	poll_wait(file, &ctx->wqh, wait);
 
-	/*
-	 * All writes to ctx->count occur within ctx->wqh.lock.  This read
-	 * can be done outside ctx->wqh.lock because we know that poll_wait
-	 * takes that lock (through add_wait_queue) if our caller will sleep.
-	 *
-	 * The read _can_ therefore seep into add_wait_queue's critical
-	 * section, but cannot move above it!  add_wait_queue's spin_lock acts
-	 * as an acquire barrier and ensures that the read be ordered properly
-	 * against the writes.  The following CAN happen and is safe:
-	 *
-	 *     poll                               write
-	 *     -----------------                  ------------
-	 *     lock ctx->wqh.lock (in poll_wait)
-	 *     count = ctx->count
-	 *     __add_wait_queue
-	 *     unlock ctx->wqh.lock
-	 *                                        lock ctx->qwh.lock
-	 *                                        ctx->count += n
-	 *                                        if (waitqueue_active)
-	 *                                          wake_up_locked_poll
-	 *                                        unlock ctx->qwh.lock
-	 *     eventfd_poll returns 0
-	 *
-	 * but the following, which would miss a wakeup, cannot happen:
-	 *
-	 *     poll                               write
-	 *     -----------------                  ------------
-	 *     count = ctx->count (INVALID!)
-	 *                                        lock ctx->qwh.lock
-	 *                                        ctx->count += n
-	 *                                        **waitqueue_active is false**
-	 *                                        **no wake_up_locked_poll!**
-	 *                                        unlock ctx->qwh.lock
-	 *     lock ctx->wqh.lock (in poll_wait)
-	 *     __add_wait_queue
-	 *     unlock ctx->wqh.lock
-	 *     eventfd_poll returns 0
-	 */
-	count = READ_ONCE(ctx->count);
+	smp_rmb();
+	count = ctx->count;
+
+	if (c0 < count)
+		return EPOLLIN;
+	if (c0 > count)
+		return EPOLLOUT;
 
 	if (count > 0)
 		events |= EPOLLIN;


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 5.13-rt1 + KVM = WARNING: at fs/eventfd.c:74 eventfd_signal()
  2021-07-23  2:23                             ` Hillf Danton
@ 2021-07-23  7:59                               ` Paolo Bonzini
  2021-07-23  9:48                                 ` Hillf Danton
  0 siblings, 1 reply; 13+ messages in thread
From: Paolo Bonzini @ 2021-07-23  7:59 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Thomas Gleixner, Sebastian Andrzej Siewior, Michael S. Tsirkin,
	linux-mm, LKML, Al Viro

On 23/07/21 04:23, Hillf Danton wrote:
> Detect concurrent reader and writer by reading event counter before and
> after poll_wait(), and determine feedback with the case of unstable
> counter taken into account.
> 
> Cut the big comment as the added barriers speak for themselves.

First and foremost, I'm not sure what you are trying to fix.

Second, the patch is wrong even without taking into account the lockless
accesses, because the condition for returning EPOLLOUT is certainly wrong.

Third, barriers very rarely speak for themselves.  In particular what
do they pair with?  It seems to me that you are basically reintroducing
the same mistake that commit a484c3dd9426 ("eventfd: document lockless
access in eventfd_poll", 2016-03-22) fixed, at the time where the big
comment was introduced:

     Things aren't as simple as the read barrier in eventfd_poll
     would suggest.  In fact, the read barrier, besides lacking a
     comment, is not paired in any obvious manner with another read
     barrier, and it is pointless because it is sitting between a write
     (deep in poll_wait) and the read of ctx->count.

Paolo


> +++ x/fs/eventfd.c
> @@ -131,49 +131,20 @@ static __poll_t eventfd_poll(struct file
>   {
>   	struct eventfd_ctx *ctx = file->private_data;
>   	__poll_t events = 0;
> -	u64 count;
> +	u64 c0, count;
> +
> +	c0 = ctx->count;
> +	smp_rmb();
>   
>   	poll_wait(file, &ctx->wqh, wait);
>   
> -	/*
> -	 * All writes to ctx->count occur within ctx->wqh.lock.  This read
> -	 * can be done outside ctx->wqh.lock because we know that poll_wait
> -	 * takes that lock (through add_wait_queue) if our caller will sleep.
> -	 *
> -	 * The read _can_ therefore seep into add_wait_queue's critical
> -	 * section, but cannot move above it!  add_wait_queue's spin_lock acts
> -	 * as an acquire barrier and ensures that the read be ordered properly
> -	 * against the writes.  The following CAN happen and is safe:
> -	 *
> -	 *     poll                               write
> -	 *     -----------------                  ------------
> -	 *     lock ctx->wqh.lock (in poll_wait)
> -	 *     count = ctx->count
> -	 *     __add_wait_queue
> -	 *     unlock ctx->wqh.lock
> -	 *                                        lock ctx->qwh.lock
> -	 *                                        ctx->count += n
> -	 *                                        if (waitqueue_active)
> -	 *                                          wake_up_locked_poll
> -	 *                                        unlock ctx->qwh.lock
> -	 *     eventfd_poll returns 0
> -	 *
> -	 * but the following, which would miss a wakeup, cannot happen:
> -	 *
> -	 *     poll                               write
> -	 *     -----------------                  ------------
> -	 *     count = ctx->count (INVALID!)
> -	 *                                        lock ctx->qwh.lock
> -	 *                                        ctx->count += n
> -	 *                                        **waitqueue_active is false**
> -	 *                                        **no wake_up_locked_poll!**
> -	 *                                        unlock ctx->qwh.lock
> -	 *     lock ctx->wqh.lock (in poll_wait)
> -	 *     __add_wait_queue
> -	 *     unlock ctx->wqh.lock
> -	 *     eventfd_poll returns 0
> -	 */
> -	count = READ_ONCE(ctx->count);
> +	smp_rmb();
> +	count = ctx->count;
> +
> +	if (c0 < count)
> +		return EPOLLIN;
> +	if (c0 > count)
> +		return EPOLLOUT;
>   
>   	if (count > 0)
>   		events |= EPOLLIN;
> 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 5.13-rt1 + KVM = WARNING: at fs/eventfd.c:74 eventfd_signal()
  2021-07-23  7:59                               ` Paolo Bonzini
@ 2021-07-23  9:48                                 ` Hillf Danton
  2021-07-23 10:56                                   ` Paolo Bonzini
  0 siblings, 1 reply; 13+ messages in thread
From: Hillf Danton @ 2021-07-23  9:48 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Thomas Gleixner, Sebastian Andrzej Siewior, Michael S. Tsirkin,
	linux-mm, LKML, Al Viro

On Fri, 23 Jul 2021 09:59:55 +0200  Paolo Bonzini wrote:
>On 23/07/21 04:23, Hillf Danton wrote:
>> Detect concurrent reader and writer by reading event counter before and
>> after poll_wait(), and determine feedback with the case of unstable
>> counter taken into account.
>> 
>> Cut the big comment as the added barriers speak for themselves.
>
>First and foremost, I'm not sure what you are trying to fix.

The change proposed builds the return value without assuming that the
event count is stable across poll_wait(). If it is unstable then we know
there are concurrent reader and/or writer who both are ingnored currently.

>
>Second, the patch is wrong even without taking into account the lockless
>accesses, because the condition for returning EPOLLOUT is certainly wrong.

Given it is detected that event count was consumed, there is room, though
as racy as it is, in the event count for writer to make some progress.

>
>Third, barriers very rarely speak for themselves.  In particular what
>do they pair with?

There is no need to consider pair frankly. Barriers are just readded for
removing the seep in the comment. Then the comment goes with the seep.

What the comment does not cover is the cases of more-than-two-party race.

Hillf

>It seems to me that you are basically reintroducing
>the same mistake that commit a484c3dd9426 ("eventfd: document lockless
>access in eventfd_poll", 2016-03-22) fixed, at the time where the big
>comment was introduced:
>
>     Things aren't as simple as the read barrier in eventfd_poll
>     would suggest.  In fact, the read barrier, besides lacking a
>     comment, is not paired in any obvious manner with another read
>     barrier, and it is pointless because it is sitting between a write
>     (deep in poll_wait) and the read of ctx->count.
>
>Paolo
>
>
>> +++ x/fs/eventfd.c
>> @@ -131,49 +131,20 @@ static __poll_t eventfd_poll(struct file
>>   {
>>   	struct eventfd_ctx *ctx = file->private_data;
>>   	__poll_t events = 0;
>> -	u64 count;
>> +	u64 c0, count;
>> +
>> +	c0 = ctx->count;
>> +	smp_rmb();
>>   
>>   	poll_wait(file, &ctx->wqh, wait);
>>   
>> -	/*
>> -	 * All writes to ctx->count occur within ctx->wqh.lock.  This read
>> -	 * can be done outside ctx->wqh.lock because we know that poll_wait
>> -	 * takes that lock (through add_wait_queue) if our caller will sleep.
>> -	 *
>> -	 * The read _can_ therefore seep into add_wait_queue's critical
>> -	 * section, but cannot move above it!  add_wait_queue's spin_lock acts
>> -	 * as an acquire barrier and ensures that the read be ordered properly
>> -	 * against the writes.  The following CAN happen and is safe:
>> -	 *
>> -	 *     poll                               write
>> -	 *     -----------------                  ------------
>> -	 *     lock ctx->wqh.lock (in poll_wait)
>> -	 *     count = ctx->count
>> -	 *     __add_wait_queue
>> -	 *     unlock ctx->wqh.lock
>> -	 *                                        lock ctx->qwh.lock
>> -	 *                                        ctx->count += n
>> -	 *                                        if (waitqueue_active)
>> -	 *                                          wake_up_locked_poll
>> -	 *                                        unlock ctx->qwh.lock
>> -	 *     eventfd_poll returns 0
>> -	 *
>> -	 * but the following, which would miss a wakeup, cannot happen:
>> -	 *
>> -	 *     poll                               write
>> -	 *     -----------------                  ------------
>> -	 *     count = ctx->count (INVALID!)
>> -	 *                                        lock ctx->qwh.lock
>> -	 *                                        ctx->count += n
>> -	 *                                        **waitqueue_active is false**
>> -	 *                                        **no wake_up_locked_poll!**
>> -	 *                                        unlock ctx->qwh.lock
>> -	 *     lock ctx->wqh.lock (in poll_wait)
>> -	 *     __add_wait_queue
>> -	 *     unlock ctx->wqh.lock
>> -	 *     eventfd_poll returns 0
>> -	 */
>> -	count = READ_ONCE(ctx->count);
>> +	smp_rmb();
>> +	count = ctx->count;
>> +
>> +	if (c0 < count)
>> +		return EPOLLIN;
>> +	if (c0 > count)
>> +		return EPOLLOUT;
>>   
>>   	if (count > 0)
>>   		events |= EPOLLIN;
>> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 5.13-rt1 + KVM = WARNING: at fs/eventfd.c:74 eventfd_signal()
  2021-07-23  9:48                                 ` Hillf Danton
@ 2021-07-23 10:56                                   ` Paolo Bonzini
  2021-07-24  4:33                                     ` Hillf Danton
  0 siblings, 1 reply; 13+ messages in thread
From: Paolo Bonzini @ 2021-07-23 10:56 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Thomas Gleixner, Sebastian Andrzej Siewior, Michael S. Tsirkin,
	linux-mm, LKML, Al Viro

On 23/07/21 11:48, Hillf Danton wrote:
> On Fri, 23 Jul 2021 09:59:55 +0200  Paolo Bonzini wrote:
>> First and foremost, I'm not sure what you are trying to fix.
> 
> The change proposed builds the return value without assuming that the
> event count is stable across poll_wait(). If it is unstable then we know
> there are concurrent reader and/or writer who both are ingnored currently.

Concurrent reads or writes have their own wake_up_locked_poll calls. 
Why are they not enough?

>> Second, the patch is wrong even without taking into account the lockless
>> accesses, because the condition for returning EPOLLOUT is certainly wrong.
> 
> Given it is detected that event count was consumed, there is room, though
> as racy as it is, in the event count for writer to make some progress.

For one, you do not return EPOLLIN|EPOLLOUT correctly.

>> Third, barriers very rarely speak for themselves.  In particular what
>> do they pair with?
> 
> There is no need to consider pair frankly. Barriers are just readded for
> removing the seep in the comment. Then the comment goes with the seep.

Then you would need an smp_mb() to order a spin_unlock() against a 
READ_ONCE().  But adding memory barriers randomly is the worst way to 
write a lockless algorithm that you can explain to others, and "there is 
no need to consider pair frankly" all but convinces me that you've put 
enough thought in the change you're proposing.

(Shameless plug: https://lwn.net/Articles/844224/).

> What the comment does not cover is the cases of more-than-two-party race.

But, you still haven't explained what's the bug there.

Paolo



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 5.13-rt1 + KVM = WARNING: at fs/eventfd.c:74 eventfd_signal()
  2021-07-23 10:56                                   ` Paolo Bonzini
@ 2021-07-24  4:33                                     ` Hillf Danton
  2021-07-26 11:03                                       ` Paolo Bonzini
  0 siblings, 1 reply; 13+ messages in thread
From: Hillf Danton @ 2021-07-24  4:33 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Hillf Danton, Thomas Gleixner, Sebastian Andrzej Siewior,
	Michael S. Tsirkin, linux-mm, LKML, Al Viro

On Fri, 23 Jul 2021 12:56:15 +0200 Paolo Bonzini wrote:
>On 23/07/21 11:48, Hillf Danton wrote:
>> What the comment does not cover is the cases of more-than-two-party race.
>
>But, you still haven't explained what's the bug there.

I am inclined to calling it race that has been there before and after the big
comment was added, instead of bug.


CPU1		CPU2		CPU3		CPU4
----		----		----		----
		lock WQ
		count += n
		no waiter
		unlock WQ

------------------------------- c0 = count

				lock WQ
------------------------------- c2 = count
				add waiter for EPOLLIN
				unlock WQ

						lock WQ
						count = 0
						wakeup EPOLLOUT
						unlock WQ

lock WQ
count += n
no waiter
unlock WQ
------------------------------- c1 = count


The c0 and c1 in the above chart are what I proposed, while c2 falls under the
range covered by the seep in the comment,

	 * The read _can_ therefore seep into add_wait_queue's critical
	 * section, but cannot move above it!  add_wait_queue's spin_lock acts
	 * as an acquire barrier and ensures that the read be ordered properly
	 * against the writes.

and is likely different to c1.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 5.13-rt1 + KVM = WARNING: at fs/eventfd.c:74 eventfd_signal()
  2021-07-24  4:33                                     ` Hillf Danton
@ 2021-07-26 11:03                                       ` Paolo Bonzini
  0 siblings, 0 replies; 13+ messages in thread
From: Paolo Bonzini @ 2021-07-26 11:03 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Thomas Gleixner, Sebastian Andrzej Siewior, Michael S. Tsirkin,
	linux-mm, LKML, Al Viro

On 24/07/21 06:33, Hillf Danton wrote:
> 		lock WQ
> 		count += n
> 		no waiter
> 		unlock WQ

Ok, this is a write.

> 
> 				lock WQ
> 				add waiter for EPOLLIN
> 				unlock WQ

This is eventfd_poll().  It hasn't yet returned EPOLLIN.

> 						lock WQ
> 						count = 0
> 						wakeup EPOLLOUT
> 						unlock WQ

This is a read().

> lock WQ
> count += n
> no waiter
> unlock WQ

This is wrong; after "unlock WQ" in CPU3 there *is* a waiter, no one has 
waked it up yet.

Paolo

> ------------------------------- c1 = count



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-07-26 11:03 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <df278db6-1fc0-3d42-9c0e-f5a085c6351e@redhat.com>
     [not found] ` <8dfc0ee9-b97a-8ca8-d057-31c8cad3f5b6@redhat.com>
     [not found]   ` <f0254740-944d-201b-9a66-9db1fe480ca6@redhat.com>
     [not found]     ` <475f84e2-78ee-1a24-ef57-b16c1f2651ed@redhat.com>
     [not found]       ` <20210715102249.2205-1-hdanton@sina.com>
     [not found]         ` <20210716020611.2288-1-hdanton@sina.com>
     [not found]           ` <20210716075539.2376-1-hdanton@sina.com>
     [not found]             ` <20210716093725.2438-1-hdanton@sina.com>
     [not found]               ` <a2f3f9ac-dac2-eadc-269e-91652d78ebd3@redhat.com>
2021-07-18 12:42                 ` 5.13-rt1 + KVM = WARNING: at fs/eventfd.c:74 eventfd_signal() Hillf Danton
2021-07-19 15:38                   ` Paolo Bonzini
2021-07-21  7:04                     ` Hillf Danton
2021-07-21  7:25                       ` Thomas Gleixner
2021-07-21 10:11                         ` Hillf Danton
2021-07-21 10:59                           ` Paolo Bonzini
2021-07-22  5:58                             ` Hillf Danton
2021-07-23  2:23                             ` Hillf Danton
2021-07-23  7:59                               ` Paolo Bonzini
2021-07-23  9:48                                 ` Hillf Danton
2021-07-23 10:56                                   ` Paolo Bonzini
2021-07-24  4:33                                     ` Hillf Danton
2021-07-26 11:03                                       ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).