[v3,07/13] epoll: call ep_add_event_to_uring() from ep_poll_callback()
diff mbox series

Message ID 20190516085810.31077-8-rpenyaev@suse.de
State Superseded
Headers show
Series
  • epoll: support pollable epoll from userspace
Related show

Commit Message

Roman Penyaev May 16, 2019, 8:58 a.m. UTC
Each ep_poll_callback() is called when fd calls wakeup() on epfd.
So account new event in user ring.

The tricky part here is EPOLLONESHOT.  Since we are lockless we
have to be deal with ep_poll_callbacks() called in paralle, thus
use cmpxchg to clear public event bits and filter out concurrent
call from another cpu.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

Comments

Peter Zijlstra May 31, 2019, 9:56 a.m. UTC | #1
On Thu, May 16, 2019 at 10:58:04AM +0200, Roman Penyaev wrote:
> Each ep_poll_callback() is called when fd calls wakeup() on epfd.
> So account new event in user ring.
> 
> The tricky part here is EPOLLONESHOT.  Since we are lockless we
> have to be deal with ep_poll_callbacks() called in paralle, thus
> use cmpxchg to clear public event bits and filter out concurrent
> call from another cpu.
> 
> Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index 2f551c005640..55612da9651e 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -1407,6 +1407,29 @@ struct file *get_epoll_tfile_raw_ptr(struct file *file, int tfd,
>  }
>  #endif /* CONFIG_CHECKPOINT_RESTORE */
>  
> +/**
> + * Atomically clear public event bits and return %true if the old value has
> + * public event bits set.
> + */
> +static inline bool ep_clear_public_event_bits(struct epitem *epi)
> +{
> +	__poll_t old, flags;
> +
> +	/*
> +	 * Here we race with ourselves and with ep_modify(), which can
> +	 * change the event bits.  In order not to override events updated
> +	 * by ep_modify() we have to do cmpxchg.
> +	 */
> +
> +	old = epi->event.events;
> +	do {
> +		flags = old;
> +	} while ((old = cmpxchg(&epi->event.events, flags,
> +				flags & EP_PRIVATE_BITS)) != flags);
> +
> +	return flags & ~EP_PRIVATE_BITS;
> +}

AFAICT epi->event.events also has normal writes to it, eg. in
ep_modify(). A number of architectures cannot handle concurrent normal
writes and cmpxchg() to the same variable.
Roman Penyaev May 31, 2019, 11:22 a.m. UTC | #2
On 2019-05-31 11:56, Peter Zijlstra wrote:
> On Thu, May 16, 2019 at 10:58:04AM +0200, Roman Penyaev wrote:
>> Each ep_poll_callback() is called when fd calls wakeup() on epfd.
>> So account new event in user ring.
>> 
>> The tricky part here is EPOLLONESHOT.  Since we are lockless we
>> have to be deal with ep_poll_callbacks() called in paralle, thus
>> use cmpxchg to clear public event bits and filter out concurrent
>> call from another cpu.
>> 
>> Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Al Viro <viro@zeniv.linux.org.uk>
>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> Cc: linux-fsdevel@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> 
>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>> index 2f551c005640..55612da9651e 100644
>> --- a/fs/eventpoll.c
>> +++ b/fs/eventpoll.c
>> @@ -1407,6 +1407,29 @@ struct file *get_epoll_tfile_raw_ptr(struct 
>> file *file, int tfd,
>>  }
>>  #endif /* CONFIG_CHECKPOINT_RESTORE */
>> 
>> +/**
>> + * Atomically clear public event bits and return %true if the old 
>> value has
>> + * public event bits set.
>> + */
>> +static inline bool ep_clear_public_event_bits(struct epitem *epi)
>> +{
>> +	__poll_t old, flags;
>> +
>> +	/*
>> +	 * Here we race with ourselves and with ep_modify(), which can
>> +	 * change the event bits.  In order not to override events updated
>> +	 * by ep_modify() we have to do cmpxchg.
>> +	 */
>> +
>> +	old = epi->event.events;
>> +	do {
>> +		flags = old;
>> +	} while ((old = cmpxchg(&epi->event.events, flags,
>> +				flags & EP_PRIVATE_BITS)) != flags);
>> +
>> +	return flags & ~EP_PRIVATE_BITS;
>> +}
> 
> AFAICT epi->event.events also has normal writes to it, eg. in
> ep_modify(). A number of architectures cannot handle concurrent normal
> writes and cmpxchg() to the same variable.

Yes, we race with the current function and with ep_modify().  Then, 
ep_modify()
should do something as the following:

-	epi->event.events = event->events
+	xchg(&epi->event.events, event->events);

Is that ok?

Just curious: what are these archs?

Thanks.

--
Roman
Peter Zijlstra May 31, 2019, 1:05 p.m. UTC | #3
On Fri, May 31, 2019 at 01:22:54PM +0200, Roman Penyaev wrote:
> On 2019-05-31 11:56, Peter Zijlstra wrote:
> > On Thu, May 16, 2019 at 10:58:04AM +0200, Roman Penyaev wrote:

> > > +static inline bool ep_clear_public_event_bits(struct epitem *epi)
> > > +{
> > > +	__poll_t old, flags;
> > > +
> > > +	/*
> > > +	 * Here we race with ourselves and with ep_modify(), which can
> > > +	 * change the event bits.  In order not to override events updated
> > > +	 * by ep_modify() we have to do cmpxchg.
> > > +	 */
> > > +
> > > +	old = epi->event.events;
> > > +	do {
> > > +		flags = old;
> > > +	} while ((old = cmpxchg(&epi->event.events, flags,
> > > +				flags & EP_PRIVATE_BITS)) != flags);
> > > +
> > > +	return flags & ~EP_PRIVATE_BITS;
> > > +}
> > 
> > AFAICT epi->event.events also has normal writes to it, eg. in
> > ep_modify(). A number of architectures cannot handle concurrent normal
> > writes and cmpxchg() to the same variable.
> 
> Yes, we race with the current function and with ep_modify().  Then,
> ep_modify()
> should do something as the following:
> 
> -	epi->event.events = event->events
> +	xchg(&epi->event.events, event->events);
> 
> Is that ok?

That should be correct, but at that point I think we should also always
read the thing with READ_ONCE() to avoid load-tearing. And I suspect it
then becomes sensible to change the type to atomic_t.

atomic_set() vs atomic_cmpxchg() only carries the extra overhead on
those 'dodgy' platforms.

> Just curious: what are these archs?

Oh, lovely stuff like parisc, sparc32 and arc-eznps. See
arch/parisc/lib/bitops.c:__cmpxchg_*() for example :/ Those systems only
have a single truly atomic op (something from the xchg / test-and-set
family) and the rest is fudged on top of that.
Roman Penyaev May 31, 2019, 3:05 p.m. UTC | #4
On 2019-05-31 15:05, Peter Zijlstra wrote:
> On Fri, May 31, 2019 at 01:22:54PM +0200, Roman Penyaev wrote:
>> On 2019-05-31 11:56, Peter Zijlstra wrote:
>> > On Thu, May 16, 2019 at 10:58:04AM +0200, Roman Penyaev wrote:
> 
>> > > +static inline bool ep_clear_public_event_bits(struct epitem *epi)
>> > > +{
>> > > +	__poll_t old, flags;
>> > > +
>> > > +	/*
>> > > +	 * Here we race with ourselves and with ep_modify(), which can
>> > > +	 * change the event bits.  In order not to override events updated
>> > > +	 * by ep_modify() we have to do cmpxchg.
>> > > +	 */
>> > > +
>> > > +	old = epi->event.events;
>> > > +	do {
>> > > +		flags = old;
>> > > +	} while ((old = cmpxchg(&epi->event.events, flags,
>> > > +				flags & EP_PRIVATE_BITS)) != flags);
>> > > +
>> > > +	return flags & ~EP_PRIVATE_BITS;
>> > > +}
>> >
>> > AFAICT epi->event.events also has normal writes to it, eg. in
>> > ep_modify(). A number of architectures cannot handle concurrent normal
>> > writes and cmpxchg() to the same variable.
>> 
>> Yes, we race with the current function and with ep_modify().  Then,
>> ep_modify()
>> should do something as the following:
>> 
>> -	epi->event.events = event->events
>> +	xchg(&epi->event.events, event->events);
>> 
>> Is that ok?
> 
> That should be correct, but at that point I think we should also always
> read the thing with READ_ONCE() to avoid load-tearing. And I suspect it
> then becomes sensible to change the type to atomic_t.

But it seems if we afraid of load tearing that should be fixed 
separately,
independently of this patchset, because epi->event.events is updated
in ep_modify() and races with ep_poll_callback(), which reads the value
in couple of places.

Probably nothing terrible will happen, because eventually event comes
or just ignored.


> atomic_set() vs atomic_cmpxchg() only carries the extra overhead on
> those 'dodgy' platforms.
> 
>> Just curious: what are these archs?
> 
> Oh, lovely stuff like parisc, sparc32 and arc-eznps. See
> arch/parisc/lib/bitops.c:__cmpxchg_*() for example :/ Those systems 
> only
> have a single truly atomic op (something from the xchg / test-and-set
> family) and the rest is fudged on top of that.

Locks, nice.

--
Roman

Patch
diff mbox series

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 2f551c005640..55612da9651e 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1407,6 +1407,29 @@  struct file *get_epoll_tfile_raw_ptr(struct file *file, int tfd,
 }
 #endif /* CONFIG_CHECKPOINT_RESTORE */
 
+/**
+ * Atomically clear public event bits and return %true if the old value has
+ * public event bits set.
+ */
+static inline bool ep_clear_public_event_bits(struct epitem *epi)
+{
+	__poll_t old, flags;
+
+	/*
+	 * Here we race with ourselves and with ep_modify(), which can
+	 * change the event bits.  In order not to override events updated
+	 * by ep_modify() we have to do cmpxchg.
+	 */
+
+	old = epi->event.events;
+	do {
+		flags = old;
+	} while ((old = cmpxchg(&epi->event.events, flags,
+				flags & EP_PRIVATE_BITS)) != flags);
+
+	return flags & ~EP_PRIVATE_BITS;
+}
+
 /**
  * Adds a new entry to the tail of the list in a lockless way, i.e.
  * multiple CPUs are allowed to call this function concurrently.
@@ -1526,6 +1549,20 @@  static int ep_poll_callback(struct epitem *epi, __poll_t pollflags)
 	if (pollflags && !(pollflags & epi->event.events))
 		goto out_unlock;
 
+	if (ep_polled_by_user(ep)) {
+		/*
+		 * For polled descriptor from user we have to disable events on
+		 * callback path in case of one-shot.
+		 */
+		if ((epi->event.events & EPOLLONESHOT) &&
+		    !ep_clear_public_event_bits(epi))
+			/* Race is lost, another callback has cleared events */
+			goto out_unlock;
+
+		ep_add_event_to_uring(epi, pollflags);
+		goto wakeup;
+	}
+
 	/*
 	 * If we are transferring events to userspace, we can hold no locks
 	 * (because we're accessing user memory, and because of linux f_op->poll()
@@ -1545,6 +1582,7 @@  static int ep_poll_callback(struct epitem *epi, __poll_t pollflags)
 		ep_pm_stay_awake_rcu(epi);
 	}
 
+wakeup:
 	/*
 	 * Wake up ( if active ) both the eventpoll wait list and the ->poll()
 	 * wait list.