linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/13] epoll: support pollable epoll from userspace
@ 2019-05-16  8:57 Roman Penyaev
  2019-05-16  8:57 ` [PATCH v3 01/13] epoll: move private helpers from a header to the source Roman Penyaev
                   ` (15 more replies)
  0 siblings, 16 replies; 50+ messages in thread
From: Roman Penyaev @ 2019-05-16  8:57 UTC (permalink / raw)
  Cc: Azat Khuzhin, Roman Penyaev, Andrew Morton, Al Viro,
	Linus Torvalds, linux-fsdevel, linux-kernel

Hi all,

This is v3 which introduces pollable epoll from userspace.

v3:
 - Measurements made, represented below.

 - Fix alignment for epoll_uitem structure on all 64-bit archs except
   x86-64. epoll_uitem should be always 16 bit, proper BUILD_BUG_ON
   is added. (Linus)

 - Check pollflags explicitly on 0 inside work callback, and do nothing
   if 0.

v2:
 - No reallocations, the max number of items (thus size of the user ring)
   is specified by the caller.

 - Interface is simplified: -ENOSPC is returned on attempt to add a new
   epoll item if number is reached the max, nothing more.

 - Alloced pages are accounted using user->locked_vm and limited to
   RLIMIT_MEMLOCK value.

 - EPOLLONESHOT is handled.

This series introduces pollable epoll from userspace, i.e. user creates
epfd with a new EPOLL_USERPOLL flag, mmaps epoll descriptor, gets header
and ring pointers and then consumes ready events from a ring, avoiding
epoll_wait() call.  When ring is empty, user has to call epoll_wait()
in order to wait for new events.  epoll_wait() returns -ESTALE if user
ring has events in the ring (kind of indication, that user has to consume
events from the user ring first, I could not invent anything better than
returning -ESTALE).

For user header and user ring allocation I used vmalloc_user().  I found
that it is much easy to reuse remap_vmalloc_range_partial() instead of
dealing with page cache (like aio.c does).  What is also nice is that
virtual address is properly aligned on SHMLBA, thus there should not be
any d-cache aliasing problems on archs with vivt or vipt caches.

** Measurements

In order to measure polling from userspace libevent was modified [1] and
bench_http benchmark (client and server) was used:

 o EPOLLET, original epoll:

    20000 requests in 0.551306 sec. (36277.49 throughput)
    Each took about 5.54 msec latency
    1600000bytes read. 0 errors.

 o EPOLLET + polling from userspace:

    20000 requests in 0.475585 sec. (42053.47 throughput)
    Each took about 4.78 msec latency
    1600000bytes read. 0 errors.

So harvesting events from userspace gives 15% gain.  Though bench_http
is not ideal benchmark, but at least it is the part of libevent and was
easy to modify.

Worth to mention that uepoll is very sensible to CPU, e.g. the gain above
is observed on desktop "Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz", but on
"Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz" measurements are almost the
same for both runs.

** Limitations
    
1. Expect always EPOLLET flag for new epoll items (Edge Triggered behavior)
     obviously we can't call vfs_epoll() from userpace to have level
     triggered behaviour.
    
2. No support for EPOLLWAKEUP
     events are consumed from userspace, thus no way to call __pm_relax()
    
3. No support for EPOLLEXCLUSIVE
     If device does not pass pollflags to wake_up() there is no way to
     call poll() from the context under spinlock, thus special work is
     scheduled to offload polling.  In this specific case we can't
     support exclusive wakeups, because we do not know actual result
     of scheduled work and have to wake up every waiter.

** Principle of operation

* Basic structures shared with userspace:

In order to consume events from userspace all inserted items should be
stored in items array, which has original epoll_event field and u32
field for keeping ready events, i.e. each item has the following struct:

 struct epoll_uitem {
    __poll_t ready_events;
    struct epoll_event event;
 };
 BUILD_BUG_ON(sizeof(struct epoll_uitem) != 16);

And the following is a header, which is seen by userspace:

 struct epoll_uheader {
    u32 magic;          /* epoll user header magic */
    u32 header_length;  /* length of the header + items */
    u32 index_length;   /* length of the index ring, always pow2 */
    u32 max_items_nr;   /* max num of items */
    u32 head;           /* updated by userland */
    u32 int tail;       /* updated by kernel */

	struct epoll_uitem items[] __aligned(128);
 };

 /* Header is 128 bytes, thus items are aligned on CPU cache */
 BUILD_BUG_ON(sizeof(struct epoll_uheader) != 128);

In order to poll epfd from userspace application has to call:

   epoll_create2(EPOLL_USERPOLL, max_items_nr);

Ready events are kept in a ring buffer, which is simply an index table,
where each element points to an item in a header:

 unsinged int *user_index;

* How is new event accounted on kernel side?  Hot it is consumed from
* userspace?

When new event comes for some epoll item kernel does the following:

 struct epoll_uitem *uitem;

 /* Each item has a bit (index in user items array), discussed later */
 uitem = user_header->items[epi->bit];

 if (!atomic_fetch_or(uitem->ready_events, pollflags)) {
     i = atomic_add(&ep->user_header->tail, 1);

     item_idx = &user_index[i & index_mask];

     /* Signal with a bit, user spins on index expecting value > 0 */
     *item_idx = idx + 1;

    /*
     * Want index update be flushed from CPU write buffer and
     * immediately visible on userspace side to avoid long busy
     * loops.
     */
     smp_wmb();
 }

Important thing here is that ring can't infinitely grow and corrupt other
elements, because kernel always checks that item was marked as ready, so
userspace has to clear ready_events field.

On userside events the following code should be used in order to consume
events:

 tail = READ_ONCE(header->tail);
 for (i = 0; header->head != tail; header->head++) {
     item_idx_ptr = &index[idx & indeces_mask];

     /*
      * Spin here till we see valid index
      */
     while (!(idx = __atomic_load_n(item_idx_ptr, __ATOMIC_ACQUIRE)))
         ;

     item = &header->items[idx - 1];

     /*
      * Mark index as invalid, that is for userspace only, kernel does not care
      * and will refill this pointer only when observes that event is cleared,
      * which happens below.
      */
     *item_idx_ptr = 0;

     /*
      * Fetch data first, if event is cleared by the kernel we drop the data
      * returning false.
      */
     event->data = item->event.data;
     event->events = __atomic_exchange_n(&item->ready_events, 0,
                         __ATOMIC_RELEASE);

 }

* How new epoll item gets its index inside user items array?

Kernel has a bitmap for that and gets free bit on attempt to insert a new
epoll item.  When bitmap is full -ENOSPC is returned.

* Is there any testing app available?

There is a small app [2] which starts many threads with many event fds and
produces many events, while single consumer fetches them from userspace
and goes to kernel from time to time in order to wait.

[1] https://github.com/libevent/libevent/pull/801
[2] https://github.com/rouming/test-tools/blob/master/userpolled-epoll.c


Roman Penyaev (13):
  epoll: move private helpers from a header to the source
  epoll: introduce user structures for polling from userspace
  epoll: allocate user header and user events ring for polling from
    userspace
  epoll: some sanity flags checks for epoll syscalls for polling from
    userspace
  epoll: offload polling to a work in case of epfd polled from userspace
  epoll: introduce helpers for adding/removing events to uring
  epoll: call ep_add_event_to_uring() from ep_poll_callback()
  epoll: support polling from userspace for ep_insert()
  epoll: support polling from userspace for ep_remove()
  epoll: support polling from userspace for ep_modify()
  epoll: support polling from userspace for ep_poll()
  epoll: support mapping for epfd when polled from userspace
  epoll: implement epoll_create2() syscall

 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/eventpoll.c                         | 714 ++++++++++++++++++++++---
 include/linux/syscalls.h               |   1 +
 include/uapi/asm-generic/unistd.h      |   4 +-
 include/uapi/linux/eventpoll.h         |  46 +-
 kernel/sys_ni.c                        |   1 +
 7 files changed, 669 insertions(+), 99 deletions(-)

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
-- 
2.21.0


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v3 01/13] epoll: move private helpers from a header to the source
  2019-05-16  8:57 [PATCH v3 00/13] epoll: support pollable epoll from userspace Roman Penyaev
@ 2019-05-16  8:57 ` Roman Penyaev
  2019-05-16  8:57 ` [PATCH v3 02/13] epoll: introduce user structures for polling from userspace Roman Penyaev
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: Roman Penyaev @ 2019-05-16  8:57 UTC (permalink / raw)
  Cc: Azat Khuzhin, Roman Penyaev, Andrew Morton, Al Viro,
	Linus Torvalds, linux-fsdevel, linux-kernel

Those helpers will access private eventpoll structure in future patches,
so keep those helpers close to callers.

Nothing important here.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 4a0e98d87fcc..2cc183e86a29 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -466,6 +466,19 @@ static inline void ep_set_busy_poll_napi_id(struct epitem *epi)
 
 #endif /* CONFIG_NET_RX_BUSY_POLL */
 
+#ifdef CONFIG_PM_SLEEP
+static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
+{
+	if ((epev->events & EPOLLWAKEUP) && !capable(CAP_BLOCK_SUSPEND))
+		epev->events &= ~EPOLLWAKEUP;
+}
+#else
+static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
+{
+	epev->events &= ~EPOLLWAKEUP;
+}
+#endif /* CONFIG_PM_SLEEP */
+
 /**
  * ep_call_nested - Perform a bound (possibly) nested call, by checking
  *                  that the recursion limit is not exceeded, and that
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index 8a3432d0f0dc..39dfc29f0f52 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -79,16 +79,4 @@ struct epoll_event {
 	__u64 data;
 } EPOLL_PACKED;
 
-#ifdef CONFIG_PM_SLEEP
-static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
-{
-	if ((epev->events & EPOLLWAKEUP) && !capable(CAP_BLOCK_SUSPEND))
-		epev->events &= ~EPOLLWAKEUP;
-}
-#else
-static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
-{
-	epev->events &= ~EPOLLWAKEUP;
-}
-#endif
 #endif /* _UAPI_LINUX_EVENTPOLL_H */
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 02/13] epoll: introduce user structures for polling from userspace
  2019-05-16  8:57 [PATCH v3 00/13] epoll: support pollable epoll from userspace Roman Penyaev
  2019-05-16  8:57 ` [PATCH v3 01/13] epoll: move private helpers from a header to the source Roman Penyaev
@ 2019-05-16  8:57 ` Roman Penyaev
  2019-05-16  8:58 ` [PATCH v3 03/13] epoll: allocate user header and user events ring " Roman Penyaev
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: Roman Penyaev @ 2019-05-16  8:57 UTC (permalink / raw)
  Cc: Azat Khuzhin, Roman Penyaev, Andrew Morton, Al Viro,
	Linus Torvalds, linux-fsdevel, linux-kernel

This one introduces structures of user items array:

struct epoll_uheader -
    describes inserted epoll items.

struct epoll_uitem -
    single epoll item visible to userspace.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 2cc183e86a29..f598442512f3 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -9,6 +9,8 @@
  *
  *  Davide Libenzi <davidel@xmailserver.org>
  *
+ *  Polling from userspace support by Roman Penyaev <rpenyaev@suse.de>
+ *  (C) Copyright 2019 SUSE, All Rights Reserved
  */
 
 #include <linux/init.h>
@@ -109,6 +111,13 @@
 
 #define EP_ITEM_COST (sizeof(struct epitem) + sizeof(struct eppoll_entry))
 
+/*
+ * That is around 1.3mb of allocated memory for one epfd.  What is more
+ * important is ->index_length, which should be ^2, so do not increase
+ * max items number to avoid size doubling of user index.
+ */
+#define EP_USERPOLL_MAX_ITEMS_NR 65536
+
 struct epoll_filefd {
 	struct file *file;
 	int fd;
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index 39dfc29f0f52..161dfcbcf3b5 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -79,4 +79,32 @@ struct epoll_event {
 	__u64 data;
 } EPOLL_PACKED;
 
+#define EPOLL_USERPOLL_HEADER_MAGIC 0xeb01eb01
+#define EPOLL_USERPOLL_HEADER_SIZE  128
+
+/*
+ * Item, shared with userspace.  Unfortunately we can't embed epoll_event
+ * structure, because it is badly aligned on all 64-bit archs, except
+ * x86-64 (see EPOLL_PACKED).  sizeof(epoll_uitem) == 16
+ */
+struct epoll_uitem {
+	__poll_t ready_events;
+	__poll_t events;
+	__u64 data;
+};
+
+/*
+ * Header, shared with userspace. sizeof(epoll_uheader) == 128
+ */
+struct epoll_uheader {
+	u32 magic;          /* epoll user header magic */
+	u32 header_length;  /* length of the header + items */
+	u32 index_length;   /* length of the index ring, always pow2 */
+	u32 max_items_nr;   /* max number of items */
+	u32 head;           /* updated by userland */
+	u32 tail;           /* updated by kernel */
+
+	struct epoll_uitem items[] __aligned(EPOLL_USERPOLL_HEADER_SIZE);
+};
+
 #endif /* _UAPI_LINUX_EVENTPOLL_H */
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 03/13] epoll: allocate user header and user events ring for polling from userspace
  2019-05-16  8:57 [PATCH v3 00/13] epoll: support pollable epoll from userspace Roman Penyaev
  2019-05-16  8:57 ` [PATCH v3 01/13] epoll: move private helpers from a header to the source Roman Penyaev
  2019-05-16  8:57 ` [PATCH v3 02/13] epoll: introduce user structures for polling from userspace Roman Penyaev
@ 2019-05-16  8:58 ` Roman Penyaev
  2019-05-16  8:58 ` [PATCH v3 04/13] epoll: some sanity flags checks for epoll syscalls " Roman Penyaev
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: Roman Penyaev @ 2019-05-16  8:58 UTC (permalink / raw)
  Cc: Azat Khuzhin, Roman Penyaev, Andrew Morton, Al Viro,
	Linus Torvalds, linux-fsdevel, linux-kernel

This one allocates user header and user events ring according to max items
number, passed as a parameter.  User events (index) ring is in a pow2.
Pages, which will be shared between kernel and userspace, are accounted
through user->locked_vm counter.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index f598442512f3..e3f82ff0b26b 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -231,6 +231,27 @@ struct eventpoll {
 
 	struct file *file;
 
+	/* User header with array of items */
+	struct epoll_uheader *user_header;
+
+	/* User index, which acts as a ring of coming events */
+	unsigned int *user_index;
+
+	/* Actual length of user header, always aligned on page */
+	unsigned int header_length;
+
+	/* Actual length of user index, always pow2 */
+	unsigned int index_length;
+
+	/* Maximum possible event items */
+	unsigned int max_items_nr;
+
+	/* Items bitmap, is used to get a free bit for new registered epi */
+	unsigned long *items_bm;
+
+	/* Length of both items bitmaps, always aligned on page */
+	unsigned int items_bm_length;
+
 	/* used to optimize loop detection check */
 	int visited;
 	struct list_head visited_list_link;
@@ -381,6 +402,27 @@ static void ep_nested_calls_init(struct nested_calls *ncalls)
 	spin_lock_init(&ncalls->lock);
 }
 
+static inline unsigned int ep_to_items_length(unsigned int nr)
+{
+	struct epoll_uheader *user_header;
+
+	return PAGE_ALIGN(struct_size(user_header, items, nr));
+}
+
+static inline unsigned int ep_to_index_length(unsigned int nr)
+{
+	struct eventpoll *ep;
+	unsigned int size;
+
+	size = roundup_pow_of_two(nr << ilog2(sizeof(*ep->user_index)));
+	return max_t(typeof(size), size, PAGE_SIZE);
+}
+
+static inline unsigned int ep_to_items_bm_length(unsigned int nr)
+{
+	return PAGE_ALIGN(ALIGN(nr, 8) >> 3);
+}
+
 /**
  * ep_events_available - Checks if ready events might be available.
  *
@@ -836,6 +878,38 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
 	return 0;
 }
 
+static int ep_account_mem(struct eventpoll *ep, struct user_struct *user)
+{
+	unsigned long nr_pages, page_limit, cur_pages, new_pages;
+
+	nr_pages  = ep->header_length >> PAGE_SHIFT;
+	nr_pages += ep->index_length >> PAGE_SHIFT;
+
+	/* Don't allow more pages than we can safely lock */
+	page_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+
+	do {
+		cur_pages = atomic_long_read(&user->locked_vm);
+		new_pages = cur_pages + nr_pages;
+		if (new_pages > page_limit && !capable(CAP_IPC_LOCK))
+			return -ENOMEM;
+	} while (atomic_long_cmpxchg(&user->locked_vm, cur_pages,
+				     new_pages) != cur_pages);
+
+	return 0;
+}
+
+static void ep_unaccount_mem(struct eventpoll *ep, struct user_struct *user)
+{
+	unsigned long nr_pages;
+
+	nr_pages  = ep->header_length >> PAGE_SHIFT;
+	nr_pages += ep->index_length >> PAGE_SHIFT;
+	if (nr_pages)
+		/* When polled by user */
+		atomic_long_sub(nr_pages, &user->locked_vm);
+}
+
 static void ep_free(struct eventpoll *ep)
 {
 	struct rb_node *rbp;
@@ -883,8 +957,12 @@ static void ep_free(struct eventpoll *ep)
 
 	mutex_unlock(&epmutex);
 	mutex_destroy(&ep->mtx);
-	free_uid(ep->user);
 	wakeup_source_unregister(ep->ws);
+	vfree(ep->user_header);
+	vfree(ep->user_index);
+	vfree(ep->items_bm);
+	ep_unaccount_mem(ep, ep->user);
+	free_uid(ep->user);
 	kfree(ep);
 }
 
@@ -1037,7 +1115,7 @@ void eventpoll_release_file(struct file *file)
 	mutex_unlock(&epmutex);
 }
 
-static int ep_alloc(struct eventpoll **pep)
+static int ep_alloc(struct eventpoll **pep, int flags, size_t max_items)
 {
 	int error;
 	struct user_struct *user;
@@ -1049,6 +1127,38 @@ static int ep_alloc(struct eventpoll **pep)
 	if (unlikely(!ep))
 		goto free_uid;
 
+	if (flags & EPOLL_USERPOLL) {
+		BUILD_BUG_ON(sizeof(*ep->user_header) !=
+			     EPOLL_USERPOLL_HEADER_SIZE);
+		BUILD_BUG_ON(sizeof(ep->user_header->items[0]) != 16);
+
+		if (!max_items || max_items > EP_USERPOLL_MAX_ITEMS_NR) {
+			error = -EINVAL;
+			goto free_ep;
+		}
+		ep->max_items_nr = max_items;
+		ep->header_length = ep_to_items_length(max_items);
+		ep->index_length = ep_to_index_length(max_items);
+		ep->items_bm_length = ep_to_items_bm_length(max_items);
+
+		error = ep_account_mem(ep, user);
+		if (error)
+			goto free_ep;
+
+		ep->user_header = vmalloc_user(ep->header_length);
+		ep->user_index = vmalloc_user(ep->index_length);
+		ep->items_bm = vzalloc(ep->items_bm_length);
+		if (!ep->user_header || !ep->user_index || !ep->items_bm)
+			goto unaccount_mem;
+
+		*ep->user_header = (typeof(*ep->user_header)) {
+			.magic         = EPOLL_USERPOLL_HEADER_MAGIC,
+			.header_length = ep->header_length,
+			.index_length  = ep->index_length,
+			.max_items_nr  = ep->max_items_nr,
+		};
+	}
+
 	mutex_init(&ep->mtx);
 	rwlock_init(&ep->lock);
 	init_waitqueue_head(&ep->wq);
@@ -1062,6 +1172,13 @@ static int ep_alloc(struct eventpoll **pep)
 
 	return 0;
 
+unaccount_mem:
+	ep_unaccount_mem(ep, user);
+free_ep:
+	vfree(ep->user_header);
+	vfree(ep->user_index);
+	vfree(ep->items_bm);
+	kfree(ep);
 free_uid:
 	free_uid(user);
 	return error;
@@ -2066,7 +2183,7 @@ static void clear_tfile_check_list(void)
 /*
  * Open an eventpoll file descriptor.
  */
-static int do_epoll_create(int flags)
+static int do_epoll_create(int flags, size_t size)
 {
 	int error, fd;
 	struct eventpoll *ep = NULL;
@@ -2075,12 +2192,12 @@ static int do_epoll_create(int flags)
 	/* Check the EPOLL_* constant for consistency.  */
 	BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC);
 
-	if (flags & ~EPOLL_CLOEXEC)
+	if (flags & ~(EPOLL_CLOEXEC | EPOLL_USERPOLL))
 		return -EINVAL;
 	/*
 	 * Create the internal data structure ("struct eventpoll").
 	 */
-	error = ep_alloc(&ep);
+	error = ep_alloc(&ep, flags, size);
 	if (error < 0)
 		return error;
 	/*
@@ -2111,7 +2228,7 @@ static int do_epoll_create(int flags)
 
 SYSCALL_DEFINE1(epoll_create1, int, flags)
 {
-	return do_epoll_create(flags);
+	return do_epoll_create(flags, 0);
 }
 
 SYSCALL_DEFINE1(epoll_create, int, size)
@@ -2119,7 +2236,7 @@ SYSCALL_DEFINE1(epoll_create, int, size)
 	if (size <= 0)
 		return -EINVAL;
 
-	return do_epoll_create(0);
+	return do_epoll_create(0, 0);
 }
 
 /*
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index 161dfcbcf3b5..95ac0b564529 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -20,7 +20,8 @@
 #include <linux/types.h>
 
 /* Flags for epoll_create1.  */
-#define EPOLL_CLOEXEC O_CLOEXEC
+#define EPOLL_CLOEXEC  O_CLOEXEC
+#define EPOLL_USERPOLL 1
 
 /* Valid opcodes to issue to sys_epoll_ctl() */
 #define EPOLL_CTL_ADD 1
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 04/13] epoll: some sanity flags checks for epoll syscalls for polling from userspace
  2019-05-16  8:57 [PATCH v3 00/13] epoll: support pollable epoll from userspace Roman Penyaev
                   ` (2 preceding siblings ...)
  2019-05-16  8:58 ` [PATCH v3 03/13] epoll: allocate user header and user events ring " Roman Penyaev
@ 2019-05-16  8:58 ` Roman Penyaev
  2019-05-16  8:58 ` [PATCH v3 05/13] epoll: offload polling to a work in case of epfd polled " Roman Penyaev
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: Roman Penyaev @ 2019-05-16  8:58 UTC (permalink / raw)
  Cc: Azat Khuzhin, Roman Penyaev, Andrew Morton, Al Viro,
	Linus Torvalds, linux-fsdevel, linux-kernel

There are various of limitations if epfd is polled by user:

 1. Expect always EPOLLET flag (Edge Triggered behavior)

 2. No support for EPOLLWAKEUP
       events are consumed from userspace, thus no way to call __pm_relax()

 3. No support for EPOLLEXCLUSIVE
       If device does not pass pollflags to wake_up() there is no way to
       call poll() from the context under spinlock, thus special work is
       scheduled to offload polling.  In this specific case we can't
       support exclusive wakeups, because we do not know actual result
       of scheduled work.

4. epoll_wait() for epfd, created with EPOLL_USERPOLL flag, accepts events
   as NULL and maxevents as 0.  No other values are accepted.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index e3f82ff0b26b..81da4571f1e0 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -423,6 +423,11 @@ static inline unsigned int ep_to_items_bm_length(unsigned int nr)
 	return PAGE_ALIGN(ALIGN(nr, 8) >> 3);
 }
 
+static inline bool ep_polled_by_user(struct eventpoll *ep)
+{
+	return !!ep->user_header;
+}
+
 /**
  * ep_events_available - Checks if ready events might be available.
  *
@@ -518,13 +523,17 @@ static inline void ep_set_busy_poll_napi_id(struct epitem *epi)
 #endif /* CONFIG_NET_RX_BUSY_POLL */
 
 #ifdef CONFIG_PM_SLEEP
-static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
+static inline void ep_take_care_of_epollwakeup(struct eventpoll *ep,
+					       struct epoll_event *epev)
 {
-	if ((epev->events & EPOLLWAKEUP) && !capable(CAP_BLOCK_SUSPEND))
-		epev->events &= ~EPOLLWAKEUP;
+	if (epev->events & EPOLLWAKEUP) {
+		if (!capable(CAP_BLOCK_SUSPEND) || ep_polled_by_user(ep))
+			epev->events &= ~EPOLLWAKEUP;
+	}
 }
 #else
-static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
+static inline void ep_take_care_of_epollwakeup(struct eventpoll *ep,
+					       struct epoll_event *epev)
 {
 	epev->events &= ~EPOLLWAKEUP;
 }
@@ -2275,10 +2284,6 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 	if (!file_can_poll(tf.file))
 		goto error_tgt_fput;
 
-	/* Check if EPOLLWAKEUP is allowed */
-	if (ep_op_has_event(op))
-		ep_take_care_of_epollwakeup(&epds);
-
 	/*
 	 * We have to check that the file structure underneath the file descriptor
 	 * the user passed to us _is_ an eventpoll file. And also we do not permit
@@ -2288,10 +2293,18 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 	if (f.file == tf.file || !is_file_epoll(f.file))
 		goto error_tgt_fput;
 
+	/*
+	 * At this point it is safe to assume that the "private_data" contains
+	 * our own data structure.
+	 */
+	ep = f.file->private_data;
+
 	/*
 	 * epoll adds to the wakeup queue at EPOLL_CTL_ADD time only,
 	 * so EPOLLEXCLUSIVE is not allowed for a EPOLL_CTL_MOD operation.
-	 * Also, we do not currently supported nested exclusive wakeups.
+	 * Also, we do not currently supported nested exclusive wakeups
+	 * and EPOLLEXCLUSIVE is not supported for epoll which is polled
+	 * from userspace.
 	 */
 	if (ep_op_has_event(op) && (epds.events & EPOLLEXCLUSIVE)) {
 		if (op == EPOLL_CTL_MOD)
@@ -2299,13 +2312,18 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 		if (op == EPOLL_CTL_ADD && (is_file_epoll(tf.file) ||
 				(epds.events & ~EPOLLEXCLUSIVE_OK_BITS)))
 			goto error_tgt_fput;
+		if (ep_polled_by_user(ep))
+			goto error_tgt_fput;
 	}
 
-	/*
-	 * At this point it is safe to assume that the "private_data" contains
-	 * our own data structure.
-	 */
-	ep = f.file->private_data;
+	if (ep_op_has_event(op)) {
+		if (ep_polled_by_user(ep) && !(epds.events & EPOLLET))
+			/* Polled by user has only edge triggered behaviour */
+			goto error_tgt_fput;
+
+		/* Check if EPOLLWAKEUP is allowed */
+		ep_take_care_of_epollwakeup(ep, &epds);
+	}
 
 	/*
 	 * When we insert an epoll file descriptor, inside another epoll file
@@ -2407,14 +2425,6 @@ static int do_epoll_wait(int epfd, struct epoll_event __user *events,
 	struct fd f;
 	struct eventpoll *ep;
 
-	/* The maximum number of event must be greater than zero */
-	if (maxevents <= 0 || maxevents > EP_MAX_EVENTS)
-		return -EINVAL;
-
-	/* Verify that the area passed by the user is writeable */
-	if (!access_ok(events, maxevents * sizeof(struct epoll_event)))
-		return -EFAULT;
-
 	/* Get the "struct file *" for the eventpoll file */
 	f = fdget(epfd);
 	if (!f.file)
@@ -2433,6 +2443,20 @@ static int do_epoll_wait(int epfd, struct epoll_event __user *events,
 	 * our own data structure.
 	 */
 	ep = f.file->private_data;
+	if (!ep_polled_by_user(ep)) {
+		/* The maximum number of event must be greater than zero */
+		if (maxevents <= 0 || maxevents > EP_MAX_EVENTS)
+			goto error_fput;
+
+		/* Verify that the area passed by the user is writeable */
+		error = -EFAULT;
+		if (!access_ok(events, maxevents * sizeof(struct epoll_event)))
+			goto error_fput;
+	} else {
+		/* Use ring instead */
+		if (maxevents != 0 || events != NULL)
+			goto error_fput;
+	}
 
 	/* Time to fish for events ... */
 	error = ep_poll(ep, events, maxevents, timeout);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 05/13] epoll: offload polling to a work in case of epfd polled from userspace
  2019-05-16  8:57 [PATCH v3 00/13] epoll: support pollable epoll from userspace Roman Penyaev
                   ` (3 preceding siblings ...)
  2019-05-16  8:58 ` [PATCH v3 04/13] epoll: some sanity flags checks for epoll syscalls " Roman Penyaev
@ 2019-05-16  8:58 ` Roman Penyaev
  2019-05-21  7:51   ` Eric Wong
  2019-05-16  8:58 ` [PATCH v3 06/13] epoll: introduce helpers for adding/removing events to uring Roman Penyaev
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: Roman Penyaev @ 2019-05-16  8:58 UTC (permalink / raw)
  Cc: Azat Khuzhin, Roman Penyaev, Andrew Morton, Al Viro,
	Linus Torvalds, linux-fsdevel, linux-kernel

Not every device reports pollflags on wake_up(), expecting that it will be
polled later.  vfs_poll() can't be called from ep_poll_callback(), because
ep_poll_callback() is called under the spinlock.  Obviously userspace can't
call vfs_poll(), thus epoll has to offload vfs_poll() to a work and then to
call ep_poll_callback() with pollflags in a hand.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 81da4571f1e0..9d3905c0afbf 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -44,6 +44,7 @@
 #include <linux/seq_file.h>
 #include <linux/compat.h>
 #include <linux/rculist.h>
+#include <linux/workqueue.h>
 #include <net/busy_poll.h>
 
 /*
@@ -185,6 +186,9 @@ struct epitem {
 
 	/* The structure that describe the interested events and the source fd */
 	struct epoll_event event;
+
+	/* Work for offloading event callback */
+	struct work_struct work;
 };
 
 /*
@@ -696,6 +700,14 @@ static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi)
 		ep_remove_wait_queue(pwq);
 		kmem_cache_free(pwq_cache, pwq);
 	}
+	if (ep_polled_by_user(ep)) {
+		/*
+		 * Events polled by user require offloading to a work,
+		 * thus we have to be sure everything which was queued
+		 * has run to a completion.
+		 */
+		flush_work(&epi->work);
+	}
 }
 
 /* call only when ep->mtx is held */
@@ -1340,9 +1352,8 @@ static inline bool chain_epi_lockless(struct epitem *epi)
 }
 
 /*
- * This is the callback that is passed to the wait queue wakeup
- * mechanism. It is called by the stored file descriptors when they
- * have events to report.
+ * This is the callback that is called directly from wake queue wakeup or
+ * from a work.
  *
  * This callback takes a read lock in order not to content with concurrent
  * events from another file descriptors, thus all modifications to ->rdllist
@@ -1357,14 +1368,11 @@ static inline bool chain_epi_lockless(struct epitem *epi)
  * queues are used should be detected accordingly.  This is detected using
  * cmpxchg() operation.
  */
-static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
+static int ep_poll_callback(struct epitem *epi, __poll_t pollflags)
 {
-	int pwake = 0;
-	struct epitem *epi = ep_item_from_wait(wait);
 	struct eventpoll *ep = epi->ep;
-	__poll_t pollflags = key_to_poll(key);
+	int pwake = 0, ewake = 0;
 	unsigned long flags;
-	int ewake = 0;
 
 	read_lock_irqsave(&ep->lock, flags);
 
@@ -1382,8 +1390,9 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 	/*
 	 * Check the events coming with the callback. At this stage, not
 	 * every device reports the events in the "key" parameter of the
-	 * callback. We need to be able to handle both cases here, hence the
-	 * test for "key" != NULL before the event match test.
+	 * callback (for ep_poll_callback() case special worker is used).
+	 * We need to be able to handle both cases here, hence the test
+	 * for "key" != NULL before the event match test.
 	 */
 	if (pollflags && !(pollflags & epi->event.events))
 		goto out_unlock;
@@ -1443,23 +1452,67 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 	if (!(epi->event.events & EPOLLEXCLUSIVE))
 		ewake = 1;
 
-	if (pollflags & POLLFREE) {
-		/*
-		 * If we race with ep_remove_wait_queue() it can miss
-		 * ->whead = NULL and do another remove_wait_queue() after
-		 * us, so we can't use __remove_wait_queue().
-		 */
-		list_del_init(&wait->entry);
+	return ewake;
+}
+
+static void ep_poll_callback_work(struct work_struct *work)
+{
+	struct epitem *epi = container_of(work, typeof(*epi), work);
+	__poll_t pollflags;
+	poll_table pt;
+
+	WARN_ON(!ep_polled_by_user(epi->ep));
+
+	init_poll_funcptr(&pt, NULL);
+	pollflags = ep_item_poll(epi, &pt, 1);
+	if (pollflags)
+		(void)ep_poll_callback(epi, pollflags);
+}
+
+/*
+ * This is the callback that is passed to the wait queue wakeup
+ * mechanism. It is called by the stored file descriptors when they
+ * have events to report.
+ */
+static int ep_poll_wakeup(wait_queue_entry_t *wait, unsigned int mode,
+			  int sync, void *key)
+{
+
+	struct epitem *epi = ep_item_from_wait(wait);
+	struct eventpoll *ep = epi->ep;
+	__poll_t pollflags = key_to_poll(key);
+	int rc;
+
+	if (!ep_polled_by_user(ep) || pollflags) {
+		rc = ep_poll_callback(epi, pollflags);
+
+		if (pollflags & POLLFREE) {
+			/*
+			 * If we race with ep_remove_wait_queue() it can miss
+			 * ->whead = NULL and do another remove_wait_queue()
+			 * after us, so we can't use __remove_wait_queue().
+			 */
+			list_del_init(&wait->entry);
+			/*
+			 * ->whead != NULL protects us from the race with
+			 * ep_free() or ep_remove(), ep_remove_wait_queue()
+			 * takes whead->lock held by the caller. Once we nullify
+			 * it, nothing protects ep/epi or even wait.
+			 */
+			smp_store_release(&ep_pwq_from_wait(wait)->whead, NULL);
+		}
+	} else {
+		schedule_work(&epi->work);
+
 		/*
-		 * ->whead != NULL protects us from the race with ep_free()
-		 * or ep_remove(), ep_remove_wait_queue() takes whead->lock
-		 * held by the caller. Once we nullify it, nothing protects
-		 * ep/epi or even wait.
+		 * Here on this path we are absolutely sure that for file
+		 * descriptors* which are pollable from userspace we do not
+		 * support EPOLLEXCLUSIVE, so it is safe to return 1.
 		 */
-		smp_store_release(&ep_pwq_from_wait(wait)->whead, NULL);
+		rc = 1;
 	}
 
-	return ewake;
+	return rc;
 }
 
 /*
@@ -1473,7 +1526,7 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
 	struct eppoll_entry *pwq;
 
 	if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) {
-		init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
+		init_waitqueue_func_entry(&pwq->wait, ep_poll_wakeup);
 		pwq->whead = whead;
 		pwq->base = epi;
 		if (epi->event.events & EPOLLEXCLUSIVE)
@@ -1667,6 +1720,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
 	INIT_LIST_HEAD(&epi->rdllink);
 	INIT_LIST_HEAD(&epi->fllink);
 	INIT_LIST_HEAD(&epi->pwqlist);
+	INIT_WORK(&epi->work, ep_poll_callback_work);
 	epi->ep = ep;
 	ep_set_ffd(&epi->ffd, tfile, fd);
 	epi->event = *event;
@@ -2547,12 +2601,6 @@ static int __init eventpoll_init(void)
 	ep_nested_calls_init(&poll_safewake_ncalls);
 #endif
 
-	/*
-	 * We can have many thousands of epitems, so prevent this from
-	 * using an extra cache line on 64-bit (and smaller) CPUs
-	 */
-	BUILD_BUG_ON(sizeof(void *) <= 8 && sizeof(struct epitem) > 128);
-
 	/* Allocates slab cache used to allocate "struct epitem" items */
 	epi_cache = kmem_cache_create("eventpoll_epi", sizeof(struct epitem),
 			0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 06/13] epoll: introduce helpers for adding/removing events to uring
  2019-05-16  8:57 [PATCH v3 00/13] epoll: support pollable epoll from userspace Roman Penyaev
                   ` (4 preceding siblings ...)
  2019-05-16  8:58 ` [PATCH v3 05/13] epoll: offload polling to a work in case of epfd polled " Roman Penyaev
@ 2019-05-16  8:58 ` Roman Penyaev
  2019-05-31  9:55   ` Peter Zijlstra
  2019-05-31  9:56   ` Peter Zijlstra
  2019-05-16  8:58 ` [PATCH v3 07/13] epoll: call ep_add_event_to_uring() from ep_poll_callback() Roman Penyaev
                   ` (9 subsequent siblings)
  15 siblings, 2 replies; 50+ messages in thread
From: Roman Penyaev @ 2019-05-16  8:58 UTC (permalink / raw)
  Cc: Azat Khuzhin, Roman Penyaev, Andrew Morton, Al Viro,
	Linus Torvalds, linux-fsdevel, linux-kernel

Both add and remove events are lockless and can be called in parallel.

ep_add_event_to_uring():
	o user item is marked atomically as ready
	o if on previous stem user item was observed as not ready,
	  then new entry is created for the index uring.

ep_remove_user_item():
	o user item is marked as EPOLLREMOVED only if it was ready,
	  thus userspace will obseve previously added entry in index
	  uring and correct "removed" state of the item.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 9d3905c0afbf..2f551c005640 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -189,6 +189,9 @@ struct epitem {
 
 	/* Work for offloading event callback */
 	struct work_struct work;
+
+	/* Bit in user bitmap for user polling */
+	unsigned int bit;
 };
 
 /*
@@ -427,6 +430,11 @@ static inline unsigned int ep_to_items_bm_length(unsigned int nr)
 	return PAGE_ALIGN(ALIGN(nr, 8) >> 3);
 }
 
+static inline unsigned int ep_max_index_nr(struct eventpoll *ep)
+{
+	return ep->index_length >> ilog2(sizeof(*ep->user_index));
+}
+
 static inline bool ep_polled_by_user(struct eventpoll *ep)
 {
 	return !!ep->user_header;
@@ -857,6 +865,127 @@ static void epi_rcu_free(struct rcu_head *head)
 	kmem_cache_free(epi_cache, epi);
 }
 
+#define atomic_set_unless_zero(ptr, flags)			\
+({								\
+	typeof(ptr) _ptr = (ptr);				\
+	typeof(flags) _flags = (flags);				\
+	typeof(*_ptr) _old, _val = READ_ONCE(*_ptr);		\
+								\
+	for (;;) {						\
+		if (!_val)					\
+			break;					\
+		_old = cmpxchg(_ptr, _val, _flags);		\
+		if (_old == _val)				\
+			break;					\
+		_val = _old;					\
+	}							\
+	_val;							\
+})
+
+static inline void ep_remove_user_item(struct epitem *epi)
+{
+	struct eventpoll *ep = epi->ep;
+	struct epoll_uitem *uitem;
+
+	lockdep_assert_held(&ep->mtx);
+
+	/* Event should not have any attached queues */
+	WARN_ON(!list_empty(&epi->pwqlist));
+
+	uitem = &ep->user_header->items[epi->bit];
+
+	/*
+	 * User item can be in two states: signaled (read_events is set
+	 * and userspace has not yet consumed this event) and not signaled
+	 * (no events yet fired or already consumed by userspace).
+	 * We reset ready_events to EPOLLREMOVED only if ready_events is
+	 * in signaled state (we expect that userspace will come soon and
+	 * fetch this event).  In case of not signaled leave read_events
+	 * as 0.
+	 *
+	 * Why it is important to mark read_events as EPOLLREMOVED in case
+	 * of already signaled state?  ep_insert() op can be immediately
+	 * called after ep_remove(), thus the same bit can be reused and
+	 * then new event comes, which corresponds to the same entry inside
+	 * user items array.  For this particular case ep_add_event_to_uring()
+	 * does not allocate a new index entry, but simply masks EPOLLREMOVED,
+	 * and userspace uses old index entry, but meanwhile old user item
+	 * has been removed, new item has been added and event updated.
+	 */
+	atomic_set_unless_zero(&uitem->ready_events, EPOLLREMOVED);
+	clear_bit(epi->bit, ep->items_bm);
+}
+
+#define atomic_or_with_mask(ptr, flags, mask)			\
+({								\
+	typeof(ptr) _ptr = (ptr);				\
+	typeof(flags) _flags = (flags);				\
+	typeof(flags) _mask = (mask);				\
+	typeof(*_ptr) _old, _new, _val = READ_ONCE(*_ptr);	\
+								\
+	for (;;) {						\
+		_new = (_val & ~_mask) | _flags;		\
+		_old = cmpxchg(_ptr, _val, _new);		\
+		if (_old == _val)				\
+			break;					\
+		_val = _old;					\
+	}							\
+	_val;							\
+})
+
+static inline bool ep_add_event_to_uring(struct epitem *epi, __poll_t pollflags)
+{
+	struct eventpoll *ep = epi->ep;
+	struct epoll_uitem *uitem;
+	bool added = false;
+
+	if (WARN_ON(!pollflags))
+		return false;
+
+	uitem = &ep->user_header->items[epi->bit];
+	/*
+	 * Can be represented as:
+	 *
+	 *    was_ready = uitem->ready_events;
+	 *    uitem->ready_events &= ~EPOLLREMOVED;
+	 *    uitem->ready_events |= pollflags;
+	 *    if (!was_ready) {
+	 *         // create index entry
+	 *    }
+	 *
+	 * See the big comment inside ep_remove_user_item(), why it is
+	 * important to mask EPOLLREMOVED.
+	 */
+	if (!atomic_or_with_mask(&uitem->ready_events,
+				 pollflags, EPOLLREMOVED)) {
+		unsigned int i, *item_idx, index_mask;
+
+		/*
+		 * Item was not ready before, thus we have to insert
+		 * new index to the ring.
+		 */
+
+		index_mask = ep_max_index_nr(ep) - 1;
+		i = __atomic_fetch_add(&ep->user_header->tail, 1,
+				       __ATOMIC_ACQUIRE);
+		item_idx = &ep->user_index[i & index_mask];
+
+		/* Signal with a bit, which is > 0 */
+		*item_idx = epi->bit + 1;
+
+		/*
+		 * Want index update be flushed from CPU write buffer and
+		 * immediately visible on userspace side to avoid long busy
+		 * loops.
+		 */
+		smp_wmb();
+
+		added = true;
+	}
+
+	return added;
+}
+
 /*
  * Removes a "struct epitem" from the eventpoll RB tree and deallocates
  * all the associated resources. Must be called with "mtx" held.
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index 95ac0b564529..70aa80ddff25 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -42,6 +42,9 @@
 #define EPOLLMSG	(__force __poll_t)0x00000400
 #define EPOLLRDHUP	(__force __poll_t)0x00002000
 
+/* User item marked as removed for EPOLL_USERPOLL */
+#define EPOLLREMOVED	((__force __poll_t)(1U << 27))
+
 /* Set exclusive wakeup mode for the target file descriptor */
 #define EPOLLEXCLUSIVE	((__force __poll_t)(1U << 28))
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 07/13] epoll: call ep_add_event_to_uring() from ep_poll_callback()
  2019-05-16  8:57 [PATCH v3 00/13] epoll: support pollable epoll from userspace Roman Penyaev
                   ` (5 preceding siblings ...)
  2019-05-16  8:58 ` [PATCH v3 06/13] epoll: introduce helpers for adding/removing events to uring Roman Penyaev
@ 2019-05-16  8:58 ` Roman Penyaev
  2019-05-31  9:56   ` Peter Zijlstra
  2019-05-16  8:58 ` [PATCH v3 08/13] epoll: support polling from userspace for ep_insert() Roman Penyaev
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: Roman Penyaev @ 2019-05-16  8:58 UTC (permalink / raw)
  Cc: Azat Khuzhin, Roman Penyaev, Andrew Morton, Al Viro,
	Linus Torvalds, linux-fsdevel, linux-kernel

Each ep_poll_callback() is called when fd calls wakeup() on epfd.
So account new event in user ring.

The tricky part here is EPOLLONESHOT.  Since we are lockless we
have to be deal with ep_poll_callbacks() called in paralle, thus
use cmpxchg to clear public event bits and filter out concurrent
call from another cpu.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 2f551c005640..55612da9651e 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1407,6 +1407,29 @@ struct file *get_epoll_tfile_raw_ptr(struct file *file, int tfd,
 }
 #endif /* CONFIG_CHECKPOINT_RESTORE */
 
+/**
+ * Atomically clear public event bits and return %true if the old value has
+ * public event bits set.
+ */
+static inline bool ep_clear_public_event_bits(struct epitem *epi)
+{
+	__poll_t old, flags;
+
+	/*
+	 * Here we race with ourselves and with ep_modify(), which can
+	 * change the event bits.  In order not to override events updated
+	 * by ep_modify() we have to do cmpxchg.
+	 */
+
+	old = epi->event.events;
+	do {
+		flags = old;
+	} while ((old = cmpxchg(&epi->event.events, flags,
+				flags & EP_PRIVATE_BITS)) != flags);
+
+	return flags & ~EP_PRIVATE_BITS;
+}
+
 /**
  * Adds a new entry to the tail of the list in a lockless way, i.e.
  * multiple CPUs are allowed to call this function concurrently.
@@ -1526,6 +1549,20 @@ static int ep_poll_callback(struct epitem *epi, __poll_t pollflags)
 	if (pollflags && !(pollflags & epi->event.events))
 		goto out_unlock;
 
+	if (ep_polled_by_user(ep)) {
+		/*
+		 * For polled descriptor from user we have to disable events on
+		 * callback path in case of one-shot.
+		 */
+		if ((epi->event.events & EPOLLONESHOT) &&
+		    !ep_clear_public_event_bits(epi))
+			/* Race is lost, another callback has cleared events */
+			goto out_unlock;
+
+		ep_add_event_to_uring(epi, pollflags);
+		goto wakeup;
+	}
+
 	/*
 	 * If we are transferring events to userspace, we can hold no locks
 	 * (because we're accessing user memory, and because of linux f_op->poll()
@@ -1545,6 +1582,7 @@ static int ep_poll_callback(struct epitem *epi, __poll_t pollflags)
 		ep_pm_stay_awake_rcu(epi);
 	}
 
+wakeup:
 	/*
 	 * Wake up ( if active ) both the eventpoll wait list and the ->poll()
 	 * wait list.
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 08/13] epoll: support polling from userspace for ep_insert()
  2019-05-16  8:57 [PATCH v3 00/13] epoll: support pollable epoll from userspace Roman Penyaev
                   ` (6 preceding siblings ...)
  2019-05-16  8:58 ` [PATCH v3 07/13] epoll: call ep_add_event_to_uring() from ep_poll_callback() Roman Penyaev
@ 2019-05-16  8:58 ` Roman Penyaev
  2019-05-16  8:58 ` [PATCH v3 09/13] epoll: support polling from userspace for ep_remove() Roman Penyaev
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: Roman Penyaev @ 2019-05-16  8:58 UTC (permalink / raw)
  Cc: Azat Khuzhin, Roman Penyaev, Andrew Morton, Al Viro,
	Linus Torvalds, linux-fsdevel, linux-kernel

When epfd is polled by userspace and new item is inserted new bit
should be get from a bitmap and then user item is set accordingly.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 55612da9651e..f1ffccfcca67 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -865,6 +865,23 @@ static void epi_rcu_free(struct rcu_head *head)
 	kmem_cache_free(epi_cache, epi);
 }
 
+static inline int ep_get_bit(struct eventpoll *ep)
+{
+	bool was_set;
+	int bit;
+
+	lockdep_assert_held(&ep->mtx);
+
+	bit = find_first_zero_bit(ep->items_bm, ep->max_items_nr);
+	if (bit >= ep->max_items_nr)
+		return -ENOSPC;
+
+	was_set = test_and_set_bit(bit, ep->items_bm);
+	WARN_ON(was_set);
+
+	return bit;
+}
+
 #define atomic_set_unless_zero(ptr, flags)			\
 ({								\
 	typeof(ptr) _ptr = (ptr);				\
@@ -1875,6 +1892,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
 	struct epitem *epi;
 	struct ep_pqueue epq;
 
+	lockdep_assert_held(&ep->mtx);
 	lockdep_assert_irqs_enabled();
 
 	user_watches = atomic_long_read(&ep->user->epoll_watches);
@@ -1901,6 +1919,29 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
 		RCU_INIT_POINTER(epi->ws, NULL);
 	}
 
+	if (ep_polled_by_user(ep)) {
+		struct epoll_uitem *uitem;
+		int bit;
+
+		bit = ep_get_bit(ep);
+		if (unlikely(bit < 0)) {
+			error = bit;
+			goto error_get_bit;
+		}
+		epi->bit = bit;
+
+		/*
+		 * Now fill-in user item.  Do not touch ready_events, since
+		 * it can be EPOLLREMOVED (has been set by previous user
+		 * item), thus user index entry can be not yet consumed
+		 * by userspace.  See ep_remove_user_item() and
+		 * ep_add_event_to_uring() for details.
+		 */
+		uitem = &ep->user_header->items[epi->bit];
+		uitem->events = event->events;
+		uitem->data = event->data;
+	}
+
 	/* Initialize the poll table using the queue callback */
 	epq.epi = epi;
 	init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);
@@ -1945,16 +1986,23 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
 	/* record NAPI ID of new item if present */
 	ep_set_busy_poll_napi_id(epi);
 
-	/* If the file is already "ready" we drop it inside the ready list */
-	if (revents && !ep_is_linked(epi)) {
-		list_add_tail(&epi->rdllink, &ep->rdllist);
-		ep_pm_stay_awake(epi);
+	if (revents) {
+		bool added = false;
 
-		/* Notify waiting tasks that events are available */
-		if (waitqueue_active(&ep->wq))
-			wake_up(&ep->wq);
-		if (waitqueue_active(&ep->poll_wait))
-			pwake++;
+		if (ep_polled_by_user(ep)) {
+			added = ep_add_event_to_uring(epi, revents);
+		} else if (!ep_is_linked(epi)) {
+			list_add_tail(&epi->rdllink, &ep->rdllist);
+			ep_pm_stay_awake(epi);
+			added = true;
+		}
+		if (added) {
+			/* Notify waiting tasks that events are available */
+			if (waitqueue_active(&ep->wq))
+				wake_up(&ep->wq);
+			if (waitqueue_active(&ep->poll_wait))
+				pwake++;
+		}
 	}
 
 	write_unlock_irq(&ep->lock);
@@ -1983,11 +2031,16 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
 	 * list, since that is used/cleaned only inside a section bound by "mtx".
 	 * And ep_insert() is called with "mtx" held.
 	 */
-	write_lock_irq(&ep->lock);
-	if (ep_is_linked(epi))
-		list_del_init(&epi->rdllink);
-	write_unlock_irq(&ep->lock);
+	if (ep_polled_by_user(ep)) {
+		ep_remove_user_item(epi);
+	} else {
+		write_lock_irq(&ep->lock);
+		if (ep_is_linked(epi))
+			list_del_init(&epi->rdllink);
+		write_unlock_irq(&ep->lock);
+	}
 
+error_get_bit:
 	wakeup_source_unregister(ep_wakeup_source(epi));
 
 error_create_wakeup_source:
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 09/13] epoll: support polling from userspace for ep_remove()
  2019-05-16  8:57 [PATCH v3 00/13] epoll: support pollable epoll from userspace Roman Penyaev
                   ` (7 preceding siblings ...)
  2019-05-16  8:58 ` [PATCH v3 08/13] epoll: support polling from userspace for ep_insert() Roman Penyaev
@ 2019-05-16  8:58 ` Roman Penyaev
  2019-05-16  8:58 ` [PATCH v3 10/13] epoll: support polling from userspace for ep_modify() Roman Penyaev
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: Roman Penyaev @ 2019-05-16  8:58 UTC (permalink / raw)
  Cc: Azat Khuzhin, Roman Penyaev, Andrew Morton, Al Viro,
	Linus Torvalds, linux-fsdevel, linux-kernel

On ep_remove() simply mark a user item with EPOLLREMOVE if the item was
ready (i.e. has some bits set).  That will prevent further user index
entry creation on item ->bit reuse.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index f1ffccfcca67..630f473973da 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1025,10 +1025,14 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
 
 	rb_erase_cached(&epi->rbn, &ep->rbr);
 
-	write_lock_irq(&ep->lock);
-	if (ep_is_linked(epi))
-		list_del_init(&epi->rdllink);
-	write_unlock_irq(&ep->lock);
+	if (ep_polled_by_user(ep)) {
+		ep_remove_user_item(epi);
+	} else {
+		write_lock_irq(&ep->lock);
+		if (ep_is_linked(epi))
+			list_del_init(&epi->rdllink);
+		write_unlock_irq(&ep->lock);
+	}
 
 	wakeup_source_unregister(ep_wakeup_source(epi));
 	/*
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 10/13] epoll: support polling from userspace for ep_modify()
  2019-05-16  8:57 [PATCH v3 00/13] epoll: support pollable epoll from userspace Roman Penyaev
                   ` (8 preceding siblings ...)
  2019-05-16  8:58 ` [PATCH v3 09/13] epoll: support polling from userspace for ep_remove() Roman Penyaev
@ 2019-05-16  8:58 ` Roman Penyaev
  2019-05-16  8:58 ` [PATCH v3 11/13] epoll: support polling from userspace for ep_poll() Roman Penyaev
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: Roman Penyaev @ 2019-05-16  8:58 UTC (permalink / raw)
  Cc: Azat Khuzhin, Roman Penyaev, Andrew Morton, Al Viro,
	Linus Torvalds, linux-fsdevel, linux-kernel

When epfd is polled from userspace and item is being modified:

1. Update user item with new pointer or poll flags.
2. Add event to user ring if needed.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 630f473973da..1b7097e58fb2 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2060,6 +2060,8 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
 static int ep_modify(struct eventpoll *ep, struct epitem *epi,
 		     const struct epoll_event *event)
 {
+	struct epoll_uitem *uitem;
+	__poll_t revents;
 	int pwake = 0;
 	poll_table pt;
 
@@ -2074,6 +2076,14 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi,
 	 */
 	epi->event.events = event->events; /* need barrier below */
 	epi->event.data = event->data; /* protected by mtx */
+
+	/* Update user item, barrier is below */
+	if (ep_polled_by_user(ep)) {
+		uitem = &ep->user_header->items[epi->bit];
+		uitem->events = event->events;
+		uitem->data = event->data;
+	}
+
 	if (epi->event.events & EPOLLWAKEUP) {
 		if (!ep_has_wakeup_source(epi))
 			ep_create_wakeup_source(epi);
@@ -2107,12 +2117,19 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi,
 	 * If the item is "hot" and it is not registered inside the ready
 	 * list, push it inside.
 	 */
-	if (ep_item_poll(epi, &pt, 1)) {
+	revents = ep_item_poll(epi, &pt, 1);
+	if (revents) {
+		bool added = false;
+
 		write_lock_irq(&ep->lock);
-		if (!ep_is_linked(epi)) {
+		if (ep_polled_by_user(ep))
+			added = ep_add_event_to_uring(epi, revents);
+		else if (!ep_is_linked(epi)) {
 			list_add_tail(&epi->rdllink, &ep->rdllist);
 			ep_pm_stay_awake(epi);
-
+			added = true;
+		}
+		if (added) {
 			/* Notify waiting tasks that events are available */
 			if (waitqueue_active(&ep->wq))
 				wake_up(&ep->wq);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 11/13] epoll: support polling from userspace for ep_poll()
  2019-05-16  8:57 [PATCH v3 00/13] epoll: support pollable epoll from userspace Roman Penyaev
                   ` (9 preceding siblings ...)
  2019-05-16  8:58 ` [PATCH v3 10/13] epoll: support polling from userspace for ep_modify() Roman Penyaev
@ 2019-05-16  8:58 ` Roman Penyaev
  2019-05-16  8:58 ` [PATCH v3 12/13] epoll: support mapping for epfd when polled from userspace Roman Penyaev
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: Roman Penyaev @ 2019-05-16  8:58 UTC (permalink / raw)
  Cc: Azat Khuzhin, Roman Penyaev, Andrew Morton, Al Viro,
	Linus Torvalds, linux-fsdevel, linux-kernel

Rule of thumb for epfd polled from userspace is simple: epfd has
events if ->head != ->tail, no traversing of each item is performed.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 1b7097e58fb2..20c94587488f 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -440,6 +440,12 @@ static inline bool ep_polled_by_user(struct eventpoll *ep)
 	return !!ep->user_header;
 }
 
+static inline bool ep_uring_events_available(struct eventpoll *ep)
+{
+	return ep_polled_by_user(ep) &&
+		ep->user_header->head != ep->user_header->tail;
+}
+
 /**
  * ep_events_available - Checks if ready events might be available.
  *
@@ -451,7 +457,8 @@ static inline bool ep_polled_by_user(struct eventpoll *ep)
 static inline int ep_events_available(struct eventpoll *ep)
 {
 	return !list_empty_careful(&ep->rdllist) ||
-		READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR;
+		READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR ||
+		ep_uring_events_available(ep);
 }
 
 #ifdef CONFIG_NET_RX_BUSY_POLL
@@ -1160,7 +1167,7 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
 static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt,
 				 int depth)
 {
-	struct eventpoll *ep;
+	struct eventpoll *ep, *tep;
 	bool locked;
 
 	pt->_key = epi->event.events;
@@ -1169,6 +1176,26 @@ static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt,
 
 	ep = epi->ffd.file->private_data;
 	poll_wait(epi->ffd.file, &ep->poll_wait, pt);
+
+	tep = epi->ffd.file->private_data;
+	if (ep_polled_by_user(tep)) {
+		/*
+		 * The behaviour differs comparing to full scan of ready
+		 * list for original epoll.  If descriptor is pollable
+		 * from userspace we don't do scan of all ready user items:
+		 * firstly because we can't do reverse search of epi by
+		 * uitem bit, secondly this is simply waste of time for
+		 * edge triggered descriptors (user code should be prepared
+		 * to deal with EAGAIN returned from read() or write() on
+		 * inserted file descriptor) and thirdly once event is put
+		 * into user index ring do not touch it from kernel, what
+		 * we do is mark it as EPOLLREMOVED on ep_remove() and
+		 * that's it.
+		 */
+		return ep_uring_events_available(tep) ?
+			EPOLLIN | EPOLLRDNORM : 0;
+	}
+
 	locked = pt && (pt->_qproc == ep_ptable_queue_proc);
 
 	return ep_scan_ready_list(epi->ffd.file->private_data,
@@ -1211,6 +1238,12 @@ static __poll_t ep_eventpoll_poll(struct file *file, poll_table *wait)
 	/* Insert inside our poll wait queue */
 	poll_wait(file, &ep->poll_wait, wait);
 
+	if (ep_polled_by_user(ep)) {
+		/* Please read detailed comments inside ep_item_poll() */
+		return ep_uring_events_available(ep) ?
+			EPOLLIN | EPOLLRDNORM : 0;
+	}
+
 	/*
 	 * Proceed to find out if wanted events are really available inside
 	 * the ready list.
@@ -2235,6 +2268,8 @@ static int ep_send_events(struct eventpoll *ep,
 {
 	struct ep_send_events_data esed;
 
+	WARN_ON(ep_polled_by_user(ep));
+
 	esed.maxevents = maxevents;
 	esed.events = events;
 
@@ -2281,6 +2316,12 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 
 	lockdep_assert_irqs_enabled();
 
+	if (ep_polled_by_user(ep)) {
+		if (ep_uring_events_available(ep))
+			/* Firstly all events from ring have to be consumed */
+			return -ESTALE;
+	}
+
 	if (timeout > 0) {
 		struct timespec64 end_time = ep_set_mstimeout(timeout);
 
@@ -2369,14 +2410,21 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 	__set_current_state(TASK_RUNNING);
 
 send_events:
-	/*
-	 * Try to transfer events to user space. In case we get 0 events and
-	 * there's still timeout left over, we go trying again in search of
-	 * more luck.
-	 */
-	if (!res && eavail &&
-	    !(res = ep_send_events(ep, events, maxevents)) && !timed_out)
-		goto fetch_events;
+	if (!res && eavail) {
+		if (!ep_polled_by_user(ep)) {
+			/*
+			 * Try to transfer events to user space. In case we get
+			 * 0 events and there's still timeout left over, we go
+			 * trying again in search of more luck.
+			 */
+			res = ep_send_events(ep, events, maxevents);
+			if (!res && !timed_out)
+				goto fetch_events;
+		} else {
+			/* User has to deal with the ring himself */
+			res = -ESTALE;
+		}
+	}
 
 	if (waiter) {
 		spin_lock_irq(&ep->wq.lock);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 12/13] epoll: support mapping for epfd when polled from userspace
  2019-05-16  8:57 [PATCH v3 00/13] epoll: support pollable epoll from userspace Roman Penyaev
                   ` (10 preceding siblings ...)
  2019-05-16  8:58 ` [PATCH v3 11/13] epoll: support polling from userspace for ep_poll() Roman Penyaev
@ 2019-05-16  8:58 ` Roman Penyaev
  2019-05-16  8:58 ` [PATCH v3 13/13] epoll: implement epoll_create2() syscall Roman Penyaev
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: Roman Penyaev @ 2019-05-16  8:58 UTC (permalink / raw)
  Cc: Azat Khuzhin, Roman Penyaev, Andrew Morton, Al Viro,
	Linus Torvalds, linux-fsdevel, linux-kernel

User has to mmap user_header and user_index vmalloce'd pointers in order
to consume events from userspace.  Also we do not let any copies of vma
on fork().

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 20c94587488f..9ff666ce7cb5 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1276,11 +1276,47 @@ static void ep_show_fdinfo(struct seq_file *m, struct file *f)
 }
 #endif
 
+static int ep_eventpoll_mmap(struct file *filep, struct vm_area_struct *vma)
+{
+	struct eventpoll *ep = vma->vm_file->private_data;
+	size_t size;
+	int rc;
+
+	if (!ep_polled_by_user(ep))
+		return -ENOTSUPP;
+
+	size = vma->vm_end - vma->vm_start;
+	if (!vma->vm_pgoff && size > ep->header_length)
+		return -ENXIO;
+	if (vma->vm_pgoff && ep->header_length != (vma->vm_pgoff << PAGE_SHIFT))
+		/* Index ring starts exactly after the header */
+		return -ENXIO;
+	if (vma->vm_pgoff && size > ep->index_length)
+		return -ENXIO;
+
+	/*
+	 * vm_pgoff is used *only* for indication, what is mapped: user header
+	 * or user index ring.  Sizes are checked above.
+	 */
+	if (!vma->vm_pgoff)
+		rc = remap_vmalloc_range_partial(vma, vma->vm_start,
+						 ep->user_header, size);
+	else
+		rc = remap_vmalloc_range_partial(vma, vma->vm_start,
+						 ep->user_index, size);
+	if (likely(!rc))
+		/* No copies for forks(), please */
+		vma->vm_flags |= VM_DONTCOPY;
+
+	return rc;
+}
+
 /* File callbacks that implement the eventpoll file behaviour */
 static const struct file_operations eventpoll_fops = {
 #ifdef CONFIG_PROC_FS
 	.show_fdinfo	= ep_show_fdinfo,
 #endif
+	.mmap		= ep_eventpoll_mmap,
 	.release	= ep_eventpoll_release,
 	.poll		= ep_eventpoll_poll,
 	.llseek		= noop_llseek,
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 13/13] epoll: implement epoll_create2() syscall
  2019-05-16  8:57 [PATCH v3 00/13] epoll: support pollable epoll from userspace Roman Penyaev
                   ` (11 preceding siblings ...)
  2019-05-16  8:58 ` [PATCH v3 12/13] epoll: support mapping for epfd when polled from userspace Roman Penyaev
@ 2019-05-16  8:58 ` Roman Penyaev
  2019-05-16 10:03   ` Arnd Bergmann
  2019-05-31  9:55 ` [PATCH v3 00/13] epoll: support pollable epoll from userspace Peter Zijlstra
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: Roman Penyaev @ 2019-05-16  8:58 UTC (permalink / raw)
  Cc: Azat Khuzhin, Roman Penyaev, Andrew Morton, Al Viro,
	Linus Torvalds, linux-fsdevel, linux-kernel

epoll_create2() is needed to accept EPOLL_USERPOLL flags
and size, i.e. this patch wires up polling from userspace.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 4cd5f982b1e5..f0d271875f4e 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
 425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
 426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
 427	i386	io_uring_register	sys_io_uring_register		__ia32_sys_io_uring_register
+428	i386	epoll_create2		sys_epoll_create2		__ia32_sys_epoll_create2
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 64ca0d06259a..5ee9bb31a552 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
 425	common	io_uring_setup		__x64_sys_io_uring_setup
 426	common	io_uring_enter		__x64_sys_io_uring_enter
 427	common	io_uring_register	__x64_sys_io_uring_register
+428	common	epoll_create2		__x64_sys_epoll_create2
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 9ff666ce7cb5..b44c3a0c4ad0 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2614,6 +2614,14 @@ static int do_epoll_create(int flags, size_t size)
 	return error;
 }
 
+SYSCALL_DEFINE2(epoll_create2, int, flags, size_t, size)
+{
+	if (size == 0)
+		return -EINVAL;
+
+	return do_epoll_create(flags, size);
+}
+
 SYSCALL_DEFINE1(epoll_create1, int, flags)
 {
 	return do_epoll_create(flags, 0);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe1be5b..5049b0d16949 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -358,6 +358,7 @@ asmlinkage long sys_eventfd2(unsigned int count, int flags);
 
 /* fs/eventpoll.c */
 asmlinkage long sys_epoll_create1(int flags);
+asmlinkage long sys_epoll_create2(int flags, size_t size);
 asmlinkage long sys_epoll_ctl(int epfd, int op, int fd,
 				struct epoll_event __user *event);
 asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index dee7292e1df6..fccfaab366ee 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -832,9 +832,11 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
 __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
 #define __NR_io_uring_register 427
 __SYSCALL(__NR_io_uring_register, sys_io_uring_register)
+#define __NR_epoll_create2 428
+__SYSCALL(__NR_epoll_create2, sys_epoll_create2)
 
 #undef __NR_syscalls
-#define __NR_syscalls 428
+#define __NR_syscalls 429
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 4d9ae5ea6caf..665908b8a326 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -65,6 +65,7 @@ COND_SYSCALL(eventfd2);
 
 /* fs/eventfd.c */
 COND_SYSCALL(epoll_create1);
+COND_SYSCALL(epoll_create2);
 COND_SYSCALL(epoll_ctl);
 COND_SYSCALL(epoll_pwait);
 COND_SYSCALL_COMPAT(epoll_pwait);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 13/13] epoll: implement epoll_create2() syscall
  2019-05-16  8:58 ` [PATCH v3 13/13] epoll: implement epoll_create2() syscall Roman Penyaev
@ 2019-05-16 10:03   ` Arnd Bergmann
  2019-05-16 10:20     ` Roman Penyaev
  0 siblings, 1 reply; 50+ messages in thread
From: Arnd Bergmann @ 2019-05-16 10:03 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: Azat Khuzhin, Andrew Morton, Al Viro, Linus Torvalds,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Thu, May 16, 2019 at 10:59 AM Roman Penyaev <rpenyaev@suse.de> wrote:
>
> epoll_create2() is needed to accept EPOLL_USERPOLL flags
> and size, i.e. this patch wires up polling from userspace.

Could you add the system call to all syscall*.tbl files at the same time here?

        Arnd

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 13/13] epoll: implement epoll_create2() syscall
  2019-05-16 10:03   ` Arnd Bergmann
@ 2019-05-16 10:20     ` Roman Penyaev
  2019-05-16 10:57       ` Arnd Bergmann
  2019-05-22  2:33       ` Andrew Morton
  0 siblings, 2 replies; 50+ messages in thread
From: Roman Penyaev @ 2019-05-16 10:20 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Azat Khuzhin, Andrew Morton, Al Viro, Linus Torvalds,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On 2019-05-16 12:03, Arnd Bergmann wrote:
> On Thu, May 16, 2019 at 10:59 AM Roman Penyaev <rpenyaev@suse.de> 
> wrote:
>> 
>> epoll_create2() is needed to accept EPOLL_USERPOLL flags
>> and size, i.e. this patch wires up polling from userspace.
> 
> Could you add the system call to all syscall*.tbl files at the same 
> time here?

For all other archs, you mean?  Sure.  But what is the rule of thumb?
Sometimes people tend to add to the most common x86 and other tables
are left untouched, but then you commit the rest, e.g.

commit 39036cd2727395c3369b1051005da74059a85317
Author: Arnd Bergmann <arnd@arndb.de>
Date:   Thu Feb 28 13:59:19 2019 +0100

     arch: add pidfd and io_uring syscalls everywhere



--
Roman

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 13/13] epoll: implement epoll_create2() syscall
  2019-05-16 10:20     ` Roman Penyaev
@ 2019-05-16 10:57       ` Arnd Bergmann
  2019-05-22  2:33       ` Andrew Morton
  1 sibling, 0 replies; 50+ messages in thread
From: Arnd Bergmann @ 2019-05-16 10:57 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: Azat Khuzhin, Andrew Morton, Al Viro, Linus Torvalds,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Thu, May 16, 2019 at 12:20 PM Roman Penyaev <rpenyaev@suse.de> wrote:
>
> On 2019-05-16 12:03, Arnd Bergmann wrote:
> > On Thu, May 16, 2019 at 10:59 AM Roman Penyaev <rpenyaev@suse.de>
> > wrote:
> >>
> >> epoll_create2() is needed to accept EPOLL_USERPOLL flags
> >> and size, i.e. this patch wires up polling from userspace.
> >
> > Could you add the system call to all syscall*.tbl files at the same
> > time here?
>
> For all other archs, you mean?  Sure.  But what is the rule of thumb?
> Sometimes people tend to add to the most common x86 and other tables
> are left untouched, but then you commit the rest, e.g.
>
> commit 39036cd2727395c3369b1051005da74059a85317
> Author: Arnd Bergmann <arnd@arndb.de>
> Date:   Thu Feb 28 13:59:19 2019 +0100
>
>      arch: add pidfd and io_uring syscalls everywhere

We only recently introduced syscall.tbl files in a common format,
which makes it much easier to add new ones. I hope we can
do it for all architectures right away from now on.

I just noticed that the new mount API assigns six new system
calls as well, but did not use the same numbers across
architectures. I hope we can still rectify that before -rc1
and use the next available ones (428..433), then yours should
be 434 on all architectures, with the exception of arch/alpha.

      Arnd

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 05/13] epoll: offload polling to a work in case of epfd polled from userspace
  2019-05-16  8:58 ` [PATCH v3 05/13] epoll: offload polling to a work in case of epfd polled " Roman Penyaev
@ 2019-05-21  7:51   ` Eric Wong
  2019-05-22 12:54     ` Roman Penyaev
  0 siblings, 1 reply; 50+ messages in thread
From: Eric Wong @ 2019-05-21  7:51 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: Azat Khuzhin, Andrew Morton, Al Viro, Linus Torvalds,
	linux-fsdevel, linux-kernel

Roman Penyaev <rpenyaev@suse.de> wrote:
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index 81da4571f1e0..9d3905c0afbf 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -44,6 +44,7 @@
>  #include <linux/seq_file.h>
>  #include <linux/compat.h>
>  #include <linux/rculist.h>
> +#include <linux/workqueue.h>
>  #include <net/busy_poll.h>
>  
>  /*
> @@ -185,6 +186,9 @@ struct epitem {
>  
>  	/* The structure that describe the interested events and the source fd */
>  	struct epoll_event event;
> +
> +	/* Work for offloading event callback */
> +	struct work_struct work;
>  };
>  
>  /*

Can we avoid the size regression for existing epoll users?

> @@ -2547,12 +2601,6 @@ static int __init eventpoll_init(void)
>  	ep_nested_calls_init(&poll_safewake_ncalls);
>  #endif
>  
> -	/*
> -	 * We can have many thousands of epitems, so prevent this from
> -	 * using an extra cache line on 64-bit (and smaller) CPUs
> -	 */
> -	BUILD_BUG_ON(sizeof(void *) <= 8 && sizeof(struct epitem) > 128);
> -
>  	/* Allocates slab cache used to allocate "struct epitem" items */
>  	epi_cache = kmem_cache_create("eventpoll_epi", sizeof(struct epitem),
>  			0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);

Perhaps a "struct uepitem" transparent union and separate slab cache.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 13/13] epoll: implement epoll_create2() syscall
  2019-05-16 10:20     ` Roman Penyaev
  2019-05-16 10:57       ` Arnd Bergmann
@ 2019-05-22  2:33       ` Andrew Morton
  2019-05-22  9:11         ` Roman Penyaev
  2019-05-22 11:14         ` Arnd Bergmann
  1 sibling, 2 replies; 50+ messages in thread
From: Andrew Morton @ 2019-05-22  2:33 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: Arnd Bergmann, Azat Khuzhin, Al Viro, Linus Torvalds,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Thu, 16 May 2019 12:20:50 +0200 Roman Penyaev <rpenyaev@suse.de> wrote:

> On 2019-05-16 12:03, Arnd Bergmann wrote:
> > On Thu, May 16, 2019 at 10:59 AM Roman Penyaev <rpenyaev@suse.de> 
> > wrote:
> >> 
> >> epoll_create2() is needed to accept EPOLL_USERPOLL flags
> >> and size, i.e. this patch wires up polling from userspace.
> > 
> > Could you add the system call to all syscall*.tbl files at the same 
> > time here?
> 
> For all other archs, you mean?  Sure.  But what is the rule of thumb?
> Sometimes people tend to add to the most common x86 and other tables
> are left untouched, but then you commit the rest, e.g.
> 
> commit 39036cd2727395c3369b1051005da74059a85317
> Author: Arnd Bergmann <arnd@arndb.de>
> Date:   Thu Feb 28 13:59:19 2019 +0100
> 
>      arch: add pidfd and io_uring syscalls everywhere
> 

I thought the preferred approach was to wire up the architectures on
which the submitter has tested the syscall, then allow the arch
maintainers to enable the syscall independently?

And to help them in this, provide a test suite for the new syscall
under tools/testing/selftests/.

https://github.com/rouming/test-tools/blob/master/userpolled-epoll.c
will certainly help but I do think it would be better to move the test
into the kernel tree to keep it maintained and so that many people run
it in their various setups on an ongoing basis.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 13/13] epoll: implement epoll_create2() syscall
  2019-05-22  2:33       ` Andrew Morton
@ 2019-05-22  9:11         ` Roman Penyaev
  2019-05-22 11:14         ` Arnd Bergmann
  1 sibling, 0 replies; 50+ messages in thread
From: Roman Penyaev @ 2019-05-22  9:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Arnd Bergmann, Azat Khuzhin, Al Viro, Linus Torvalds,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On 2019-05-22 04:33, Andrew Morton wrote:
> On Thu, 16 May 2019 12:20:50 +0200 Roman Penyaev <rpenyaev@suse.de> 
> wrote:
> 
>> On 2019-05-16 12:03, Arnd Bergmann wrote:
>> > On Thu, May 16, 2019 at 10:59 AM Roman Penyaev <rpenyaev@suse.de>
>> > wrote:
>> >>
>> >> epoll_create2() is needed to accept EPOLL_USERPOLL flags
>> >> and size, i.e. this patch wires up polling from userspace.
>> >
>> > Could you add the system call to all syscall*.tbl files at the same
>> > time here?
>> 
>> For all other archs, you mean?  Sure.  But what is the rule of thumb?
>> Sometimes people tend to add to the most common x86 and other tables
>> are left untouched, but then you commit the rest, e.g.
>> 
>> commit 39036cd2727395c3369b1051005da74059a85317
>> Author: Arnd Bergmann <arnd@arndb.de>
>> Date:   Thu Feb 28 13:59:19 2019 +0100
>> 
>>      arch: add pidfd and io_uring syscalls everywhere
>> 
> 
> I thought the preferred approach was to wire up the architectures on
> which the submitter has tested the syscall, then allow the arch
> maintainers to enable the syscall independently?
> 
> And to help them in this, provide a test suite for the new syscall
> under tools/testing/selftests/.
> 
> https://github.com/rouming/test-tools/blob/master/userpolled-epoll.c
> will certainly help but I do think it would be better to move the test
> into the kernel tree to keep it maintained and so that many people run
> it in their various setups on an ongoing basis.

Yes, on a next iteration I will add the tool to selftests. Thanks.

--
Roman



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 13/13] epoll: implement epoll_create2() syscall
  2019-05-22  2:33       ` Andrew Morton
  2019-05-22  9:11         ` Roman Penyaev
@ 2019-05-22 11:14         ` Arnd Bergmann
  2019-05-22 18:36           ` Andrew Morton
  1 sibling, 1 reply; 50+ messages in thread
From: Arnd Bergmann @ 2019-05-22 11:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Roman Penyaev, Azat Khuzhin, Al Viro, Linus Torvalds,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Wed, May 22, 2019 at 4:33 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> On Thu, 16 May 2019 12:20:50 +0200 Roman Penyaev <rpenyaev@suse.de> wrote:
> > On 2019-05-16 12:03, Arnd Bergmann wrote:
> > > On Thu, May 16, 2019 at 10:59 AM Roman Penyaev <rpenyaev@suse.de>
> > > wrote:
> > >>
> > >> epoll_create2() is needed to accept EPOLL_USERPOLL flags
> > >> and size, i.e. this patch wires up polling from userspace.
> > >
> > > Could you add the system call to all syscall*.tbl files at the same
> > > time here?
> >
> > For all other archs, you mean?  Sure.  But what is the rule of thumb?
> > Sometimes people tend to add to the most common x86 and other tables
> > are left untouched, but then you commit the rest, e.g.
> >
> > commit 39036cd2727395c3369b1051005da74059a85317
> > Author: Arnd Bergmann <arnd@arndb.de>
> > Date:   Thu Feb 28 13:59:19 2019 +0100
> >
> >      arch: add pidfd and io_uring syscalls everywhere
> >
>
> I thought the preferred approach was to wire up the architectures on
> which the submitter has tested the syscall, then allow the arch
> maintainers to enable the syscall independently?

I'm hoping to change that practice now, as it has not worked well
in the past:

- half the architectures now use asm-generic/unistd.h, so they are
  already wired up at the same time, regardless of testing
- in the other half, not adding them at the same time actually
  made it harder to test, as it was significantly harder to figure
  out how to build a modified kernel for a given architecture
  than to run the test case
- Not having all architectures add a new call at the same time caused
  the architectures to get out of sync when some got added and others
  did not. Now that we use the same numbers across all architectures,
  that would be even more confusing.

My plan for the long run is to only have one file to which new
system calls get added in the future.

> And to help them in this, provide a test suite for the new syscall
> under tools/testing/selftests/.
>
> https://github.com/rouming/test-tools/blob/master/userpolled-epoll.c
> will certainly help but I do think it would be better to move the test
> into the kernel tree to keep it maintained and so that many people run
> it in their various setups on an ongoing basis.

No disagreement on that.

      Arnd

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 05/13] epoll: offload polling to a work in case of epfd polled from userspace
  2019-05-21  7:51   ` Eric Wong
@ 2019-05-22 12:54     ` Roman Penyaev
  0 siblings, 0 replies; 50+ messages in thread
From: Roman Penyaev @ 2019-05-22 12:54 UTC (permalink / raw)
  To: Eric Wong
  Cc: Azat Khuzhin, Andrew Morton, Al Viro, Linus Torvalds,
	linux-fsdevel, linux-kernel

On 2019-05-21 09:51, Eric Wong wrote:
> Roman Penyaev <rpenyaev@suse.de> wrote:
>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>> index 81da4571f1e0..9d3905c0afbf 100644
>> --- a/fs/eventpoll.c
>> +++ b/fs/eventpoll.c
>> @@ -44,6 +44,7 @@
>>  #include <linux/seq_file.h>
>>  #include <linux/compat.h>
>>  #include <linux/rculist.h>
>> +#include <linux/workqueue.h>
>>  #include <net/busy_poll.h>
>> 
>>  /*
>> @@ -185,6 +186,9 @@ struct epitem {
>> 
>>  	/* The structure that describe the interested events and the source 
>> fd */
>>  	struct epoll_event event;
>> +
>> +	/* Work for offloading event callback */
>> +	struct work_struct work;
>>  };
>> 
>>  /*
> 
> Can we avoid the size regression for existing epoll users?

Yeah, ->next, ->rdllink, ->ws are not used for user polling, so can hold 
work
struct and co (or pointer on container which holds work struct, if work 
struct
is huge).

--
Roman


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 13/13] epoll: implement epoll_create2() syscall
  2019-05-22 11:14         ` Arnd Bergmann
@ 2019-05-22 18:36           ` Andrew Morton
  0 siblings, 0 replies; 50+ messages in thread
From: Andrew Morton @ 2019-05-22 18:36 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Roman Penyaev, Azat Khuzhin, Al Viro, Linus Torvalds,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Wed, 22 May 2019 13:14:41 +0200 Arnd Bergmann <arnd@arndb.de> wrote:

> > I thought the preferred approach was to wire up the architectures on
> > which the submitter has tested the syscall, then allow the arch
> > maintainers to enable the syscall independently?
> 
> I'm hoping to change that practice now, as it has not worked well
> in the past:
> 
> - half the architectures now use asm-generic/unistd.h, so they are
>   already wired up at the same time, regardless of testing
> - in the other half, not adding them at the same time actually
>   made it harder to test, as it was significantly harder to figure
>   out how to build a modified kernel for a given architecture
>   than to run the test case
> - Not having all architectures add a new call at the same time caused
>   the architectures to get out of sync when some got added and others
>   did not. Now that we use the same numbers across all architectures,
>   that would be even more confusing.
>
> My plan for the long run is to only have one file to which new
> system calls get added in the future.

Fair enough.  We're adding code to architectures without having tested
it on those architectures but we do that all the time anyway - I guess
there's not a lot of point in special-casing new syscalls.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 00/13] epoll: support pollable epoll from userspace
  2019-05-16  8:57 [PATCH v3 00/13] epoll: support pollable epoll from userspace Roman Penyaev
                   ` (12 preceding siblings ...)
  2019-05-16  8:58 ` [PATCH v3 13/13] epoll: implement epoll_create2() syscall Roman Penyaev
@ 2019-05-31  9:55 ` Peter Zijlstra
  2019-05-31 14:48 ` Jens Axboe
  2019-05-31 16:33 ` Peter Zijlstra
  15 siblings, 0 replies; 50+ messages in thread
From: Peter Zijlstra @ 2019-05-31  9:55 UTC (permalink / raw)
  To: Roman Penyaev; +Cc: azat, akpm, viro, torvalds, linux-fsdevel, linux-kernel

On Thu, May 16, 2019 at 10:57:57AM +0200, Roman Penyaev wrote:
> When new event comes for some epoll item kernel does the following:
> 
>  struct epoll_uitem *uitem;
> 
>  /* Each item has a bit (index in user items array), discussed later */
>  uitem = user_header->items[epi->bit];
> 
>  if (!atomic_fetch_or(uitem->ready_events, pollflags)) {
>      i = atomic_add(&ep->user_header->tail, 1);
> 
>      item_idx = &user_index[i & index_mask];
> 
>      /* Signal with a bit, user spins on index expecting value > 0 */
>      *item_idx = idx + 1;
> 
>     /*
>      * Want index update be flushed from CPU write buffer and
>      * immediately visible on userspace side to avoid long busy
>      * loops.
>      */

That is garbage; smp_wmb() does no such thing.

>      smp_wmb();
>  }

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 06/13] epoll: introduce helpers for adding/removing events to uring
  2019-05-16  8:58 ` [PATCH v3 06/13] epoll: introduce helpers for adding/removing events to uring Roman Penyaev
@ 2019-05-31  9:55   ` Peter Zijlstra
  2019-05-31 11:24     ` Roman Penyaev
  2019-05-31  9:56   ` Peter Zijlstra
  1 sibling, 1 reply; 50+ messages in thread
From: Peter Zijlstra @ 2019-05-31  9:55 UTC (permalink / raw)
  To: Roman Penyaev; +Cc: azat, akpm, viro, torvalds, linux-fsdevel, linux-kernel

On Thu, May 16, 2019 at 10:58:03AM +0200, Roman Penyaev wrote:
> +#define atomic_set_unless_zero(ptr, flags)			\
> +({								\
> +	typeof(ptr) _ptr = (ptr);				\
> +	typeof(flags) _flags = (flags);				\
> +	typeof(*_ptr) _old, _val = READ_ONCE(*_ptr);		\
> +								\
> +	for (;;) {						\
> +		if (!_val)					\
> +			break;					\
> +		_old = cmpxchg(_ptr, _val, _flags);		\
> +		if (_old == _val)				\
> +			break;					\
> +		_val = _old;					\
> +	}							\
> +	_val;							\
> +})

> +#define atomic_or_with_mask(ptr, flags, mask)			\
> +({								\
> +	typeof(ptr) _ptr = (ptr);				\
> +	typeof(flags) _flags = (flags);				\
> +	typeof(flags) _mask = (mask);				\
> +	typeof(*_ptr) _old, _new, _val = READ_ONCE(*_ptr);	\
> +								\
> +	for (;;) {						\
> +		_new = (_val & ~_mask) | _flags;		\
> +		_old = cmpxchg(_ptr, _val, _new);		\
> +		if (_old == _val)				\
> +			break;					\
> +		_val = _old;					\
> +	}							\
> +	_val;							\
> +})

Don't call them atomic_*() if they're not part of the atomic_t
interface.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 06/13] epoll: introduce helpers for adding/removing events to uring
  2019-05-16  8:58 ` [PATCH v3 06/13] epoll: introduce helpers for adding/removing events to uring Roman Penyaev
  2019-05-31  9:55   ` Peter Zijlstra
@ 2019-05-31  9:56   ` Peter Zijlstra
  2019-05-31 11:15     ` Roman Penyaev
  1 sibling, 1 reply; 50+ messages in thread
From: Peter Zijlstra @ 2019-05-31  9:56 UTC (permalink / raw)
  To: Roman Penyaev; +Cc: azat, akpm, viro, torvalds, linux-fsdevel, linux-kernel

On Thu, May 16, 2019 at 10:58:03AM +0200, Roman Penyaev wrote:
> +static inline bool ep_add_event_to_uring(struct epitem *epi, __poll_t pollflags)
> +{
> +	struct eventpoll *ep = epi->ep;
> +	struct epoll_uitem *uitem;
> +	bool added = false;
> +
> +	if (WARN_ON(!pollflags))
> +		return false;
> +
> +	uitem = &ep->user_header->items[epi->bit];
> +	/*
> +	 * Can be represented as:
> +	 *
> +	 *    was_ready = uitem->ready_events;
> +	 *    uitem->ready_events &= ~EPOLLREMOVED;
> +	 *    uitem->ready_events |= pollflags;
> +	 *    if (!was_ready) {
> +	 *         // create index entry
> +	 *    }
> +	 *
> +	 * See the big comment inside ep_remove_user_item(), why it is
> +	 * important to mask EPOLLREMOVED.
> +	 */
> +	if (!atomic_or_with_mask(&uitem->ready_events,
> +				 pollflags, EPOLLREMOVED)) {
> +		unsigned int i, *item_idx, index_mask;
> +
> +		/*
> +		 * Item was not ready before, thus we have to insert
> +		 * new index to the ring.
> +		 */
> +
> +		index_mask = ep_max_index_nr(ep) - 1;
> +		i = __atomic_fetch_add(&ep->user_header->tail, 1,
> +				       __ATOMIC_ACQUIRE);

afaict __atomic_fetch_add() does not exist.

> +		item_idx = &ep->user_index[i & index_mask];
> +
> +		/* Signal with a bit, which is > 0 */
> +		*item_idx = epi->bit + 1;

Did you just increment the user visible tail pointer before you filled
the data? That is, can the concurrent userspace observe the increment
before you put credible data in its place?

> +
> +		/*
> +		 * Want index update be flushed from CPU write buffer and
> +		 * immediately visible on userspace side to avoid long busy
> +		 * loops.
> +		 */
> +		smp_wmb();

That's still complete nonsense.

> +
> +		added = true;
> +	}
> +
> +	return added;
> +}

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 07/13] epoll: call ep_add_event_to_uring() from ep_poll_callback()
  2019-05-16  8:58 ` [PATCH v3 07/13] epoll: call ep_add_event_to_uring() from ep_poll_callback() Roman Penyaev
@ 2019-05-31  9:56   ` Peter Zijlstra
  2019-05-31 11:22     ` Roman Penyaev
  0 siblings, 1 reply; 50+ messages in thread
From: Peter Zijlstra @ 2019-05-31  9:56 UTC (permalink / raw)
  To: Roman Penyaev; +Cc: azat, akpm, viro, torvalds, linux-fsdevel, linux-kernel

On Thu, May 16, 2019 at 10:58:04AM +0200, Roman Penyaev wrote:
> Each ep_poll_callback() is called when fd calls wakeup() on epfd.
> So account new event in user ring.
> 
> The tricky part here is EPOLLONESHOT.  Since we are lockless we
> have to be deal with ep_poll_callbacks() called in paralle, thus
> use cmpxchg to clear public event bits and filter out concurrent
> call from another cpu.
> 
> Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index 2f551c005640..55612da9651e 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -1407,6 +1407,29 @@ struct file *get_epoll_tfile_raw_ptr(struct file *file, int tfd,
>  }
>  #endif /* CONFIG_CHECKPOINT_RESTORE */
>  
> +/**
> + * Atomically clear public event bits and return %true if the old value has
> + * public event bits set.
> + */
> +static inline bool ep_clear_public_event_bits(struct epitem *epi)
> +{
> +	__poll_t old, flags;
> +
> +	/*
> +	 * Here we race with ourselves and with ep_modify(), which can
> +	 * change the event bits.  In order not to override events updated
> +	 * by ep_modify() we have to do cmpxchg.
> +	 */
> +
> +	old = epi->event.events;
> +	do {
> +		flags = old;
> +	} while ((old = cmpxchg(&epi->event.events, flags,
> +				flags & EP_PRIVATE_BITS)) != flags);
> +
> +	return flags & ~EP_PRIVATE_BITS;
> +}

AFAICT epi->event.events also has normal writes to it, eg. in
ep_modify(). A number of architectures cannot handle concurrent normal
writes and cmpxchg() to the same variable.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 06/13] epoll: introduce helpers for adding/removing events to uring
  2019-05-31  9:56   ` Peter Zijlstra
@ 2019-05-31 11:15     ` Roman Penyaev
  2019-05-31 12:53       ` Peter Zijlstra
  2019-05-31 12:56       ` Peter Zijlstra
  0 siblings, 2 replies; 50+ messages in thread
From: Roman Penyaev @ 2019-05-31 11:15 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: azat, akpm, viro, torvalds, linux-fsdevel, linux-kernel

On 2019-05-31 11:56, Peter Zijlstra wrote:
> On Thu, May 16, 2019 at 10:58:03AM +0200, Roman Penyaev wrote:
>> +static inline bool ep_add_event_to_uring(struct epitem *epi, __poll_t 
>> pollflags)
>> +{
>> +	struct eventpoll *ep = epi->ep;
>> +	struct epoll_uitem *uitem;
>> +	bool added = false;
>> +
>> +	if (WARN_ON(!pollflags))
>> +		return false;
>> +
>> +	uitem = &ep->user_header->items[epi->bit];
>> +	/*
>> +	 * Can be represented as:
>> +	 *
>> +	 *    was_ready = uitem->ready_events;
>> +	 *    uitem->ready_events &= ~EPOLLREMOVED;
>> +	 *    uitem->ready_events |= pollflags;
>> +	 *    if (!was_ready) {
>> +	 *         // create index entry
>> +	 *    }
>> +	 *
>> +	 * See the big comment inside ep_remove_user_item(), why it is
>> +	 * important to mask EPOLLREMOVED.
>> +	 */
>> +	if (!atomic_or_with_mask(&uitem->ready_events,
>> +				 pollflags, EPOLLREMOVED)) {
>> +		unsigned int i, *item_idx, index_mask;
>> +
>> +		/*
>> +		 * Item was not ready before, thus we have to insert
>> +		 * new index to the ring.
>> +		 */
>> +
>> +		index_mask = ep_max_index_nr(ep) - 1;
>> +		i = __atomic_fetch_add(&ep->user_header->tail, 1,
>> +				       __ATOMIC_ACQUIRE);
> 
> afaict __atomic_fetch_add() does not exist.

That is gcc extension.  I did not find any API just to increment
the variable atomically without using/casting to atomic.  What
is a proper way to achieve that?

> 
>> +		item_idx = &ep->user_index[i & index_mask];
>> +
>> +		/* Signal with a bit, which is > 0 */
>> +		*item_idx = epi->bit + 1;
> 
> Did you just increment the user visible tail pointer before you filled
> the data? That is, can the concurrent userspace observe the increment
> before you put credible data in its place?

No, the "data" is the "ready_events" mask, which was updated before,
using cmpxchg, atomic_or_with_mask() call.  All I need is to put an
index of just updated item to the uring.

Userspace, in its turn, gets the index from the ring and then checks
the mask.

> 
>> +
>> +		/*
>> +		 * Want index update be flushed from CPU write buffer and
>> +		 * immediately visible on userspace side to avoid long busy
>> +		 * loops.
>> +		 */
>> +		smp_wmb();
> 
> That's still complete nonsense.

Yes, true.  My confusion came from the simple test, where one thread
swaps pointers in a loop, another thread dereferences pointer and
increments a variable:

     THR#0
     -----------

     unsigned vvv1 = 0, vvv2 = 0;
     unsigned *ptr;

     ptr = &vvv1;
     thr_level2 = &vvv2;

     while (!stop) {
         unsigned *tmp = *thr_level2;
         *thr_level2 = ptr;
         barrier();               <<<< ????
         ptr = tmp;
     }

     THR#1
     -----------

     while (!stop) {
    	ptr = thr_level2;
         (*ptr)++;
     }


At the end I expect `vvv1` and `vvv2` are approximately equally
incremented.  But, without barrier() only one variable is
incremented.

Now I see that barrier() should be defined as a simple compiler
barrier as asm volatile("" ::: "memory"), and there is nothing
related with write buffer as I wrote in the comment.

So indeed garbage and can be removed.  Thanks.

--
Roman


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 07/13] epoll: call ep_add_event_to_uring() from ep_poll_callback()
  2019-05-31  9:56   ` Peter Zijlstra
@ 2019-05-31 11:22     ` Roman Penyaev
  2019-05-31 13:05       ` Peter Zijlstra
  0 siblings, 1 reply; 50+ messages in thread
From: Roman Penyaev @ 2019-05-31 11:22 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: azat, akpm, viro, torvalds, linux-fsdevel, linux-kernel

On 2019-05-31 11:56, Peter Zijlstra wrote:
> On Thu, May 16, 2019 at 10:58:04AM +0200, Roman Penyaev wrote:
>> Each ep_poll_callback() is called when fd calls wakeup() on epfd.
>> So account new event in user ring.
>> 
>> The tricky part here is EPOLLONESHOT.  Since we are lockless we
>> have to be deal with ep_poll_callbacks() called in paralle, thus
>> use cmpxchg to clear public event bits and filter out concurrent
>> call from another cpu.
>> 
>> Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Al Viro <viro@zeniv.linux.org.uk>
>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> Cc: linux-fsdevel@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> 
>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>> index 2f551c005640..55612da9651e 100644
>> --- a/fs/eventpoll.c
>> +++ b/fs/eventpoll.c
>> @@ -1407,6 +1407,29 @@ struct file *get_epoll_tfile_raw_ptr(struct 
>> file *file, int tfd,
>>  }
>>  #endif /* CONFIG_CHECKPOINT_RESTORE */
>> 
>> +/**
>> + * Atomically clear public event bits and return %true if the old 
>> value has
>> + * public event bits set.
>> + */
>> +static inline bool ep_clear_public_event_bits(struct epitem *epi)
>> +{
>> +	__poll_t old, flags;
>> +
>> +	/*
>> +	 * Here we race with ourselves and with ep_modify(), which can
>> +	 * change the event bits.  In order not to override events updated
>> +	 * by ep_modify() we have to do cmpxchg.
>> +	 */
>> +
>> +	old = epi->event.events;
>> +	do {
>> +		flags = old;
>> +	} while ((old = cmpxchg(&epi->event.events, flags,
>> +				flags & EP_PRIVATE_BITS)) != flags);
>> +
>> +	return flags & ~EP_PRIVATE_BITS;
>> +}
> 
> AFAICT epi->event.events also has normal writes to it, eg. in
> ep_modify(). A number of architectures cannot handle concurrent normal
> writes and cmpxchg() to the same variable.

Yes, we race with the current function and with ep_modify().  Then, 
ep_modify()
should do something as the following:

-	epi->event.events = event->events
+	xchg(&epi->event.events, event->events);

Is that ok?

Just curious: what are these archs?

Thanks.

--
Roman




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 06/13] epoll: introduce helpers for adding/removing events to uring
  2019-05-31  9:55   ` Peter Zijlstra
@ 2019-05-31 11:24     ` Roman Penyaev
  2019-05-31 13:11       ` Peter Zijlstra
  0 siblings, 1 reply; 50+ messages in thread
From: Roman Penyaev @ 2019-05-31 11:24 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: azat, akpm, viro, torvalds, linux-fsdevel, linux-kernel

On 2019-05-31 11:55, Peter Zijlstra wrote:
> On Thu, May 16, 2019 at 10:58:03AM +0200, Roman Penyaev wrote:
>> +#define atomic_set_unless_zero(ptr, flags)			\
>> +({								\
>> +	typeof(ptr) _ptr = (ptr);				\
>> +	typeof(flags) _flags = (flags);				\
>> +	typeof(*_ptr) _old, _val = READ_ONCE(*_ptr);		\
>> +								\
>> +	for (;;) {						\
>> +		if (!_val)					\
>> +			break;					\
>> +		_old = cmpxchg(_ptr, _val, _flags);		\
>> +		if (_old == _val)				\
>> +			break;					\
>> +		_val = _old;					\
>> +	}							\
>> +	_val;							\
>> +})
> 
>> +#define atomic_or_with_mask(ptr, flags, mask)			\
>> +({								\
>> +	typeof(ptr) _ptr = (ptr);				\
>> +	typeof(flags) _flags = (flags);				\
>> +	typeof(flags) _mask = (mask);				\
>> +	typeof(*_ptr) _old, _new, _val = READ_ONCE(*_ptr);	\
>> +								\
>> +	for (;;) {						\
>> +		_new = (_val & ~_mask) | _flags;		\
>> +		_old = cmpxchg(_ptr, _val, _new);		\
>> +		if (_old == _val)				\
>> +			break;					\
>> +		_val = _old;					\
>> +	}							\
>> +	_val;							\
>> +})
> 
> Don't call them atomic_*() if they're not part of the atomic_t
> interface.

Can we add those two?  Or keep it local is better?

--
Roman




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 06/13] epoll: introduce helpers for adding/removing events to uring
  2019-05-31 11:15     ` Roman Penyaev
@ 2019-05-31 12:53       ` Peter Zijlstra
  2019-05-31 14:28         ` Roman Penyaev
  2019-05-31 12:56       ` Peter Zijlstra
  1 sibling, 1 reply; 50+ messages in thread
From: Peter Zijlstra @ 2019-05-31 12:53 UTC (permalink / raw)
  To: Roman Penyaev; +Cc: azat, akpm, viro, torvalds, linux-fsdevel, linux-kernel

On Fri, May 31, 2019 at 01:15:21PM +0200, Roman Penyaev wrote:
> On 2019-05-31 11:56, Peter Zijlstra wrote:
> > On Thu, May 16, 2019 at 10:58:03AM +0200, Roman Penyaev wrote:

> > > +		i = __atomic_fetch_add(&ep->user_header->tail, 1,
> > > +				       __ATOMIC_ACQUIRE);
> > 
> > afaict __atomic_fetch_add() does not exist.
> 
> That is gcc extension.  I did not find any API just to increment
> the variable atomically without using/casting to atomic.  What
> is a proper way to achieve that?

That's C11 atomics, and those shall not be used in the kernel. For one
they're not available in the minimally required GCC version (4.6).

The proper and only way is to use atomic_t, but also you cannot share
atomic_t with userspace.

The normal way of doing something like this is to have a kernel private
atomic_t and copy the value out to userspace using smp_store_release().

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 06/13] epoll: introduce helpers for adding/removing events to uring
  2019-05-31 11:15     ` Roman Penyaev
  2019-05-31 12:53       ` Peter Zijlstra
@ 2019-05-31 12:56       ` Peter Zijlstra
  2019-05-31 14:21         ` Roman Penyaev
  1 sibling, 1 reply; 50+ messages in thread
From: Peter Zijlstra @ 2019-05-31 12:56 UTC (permalink / raw)
  To: Roman Penyaev; +Cc: azat, akpm, viro, torvalds, linux-fsdevel, linux-kernel

On Fri, May 31, 2019 at 01:15:21PM +0200, Roman Penyaev wrote:
> On 2019-05-31 11:56, Peter Zijlstra wrote:
> > On Thu, May 16, 2019 at 10:58:03AM +0200, Roman Penyaev wrote:
> > > +static inline bool ep_add_event_to_uring(struct epitem *epi,
> > > __poll_t pollflags)
> > > +{
> > > +	struct eventpoll *ep = epi->ep;
> > > +	struct epoll_uitem *uitem;
> > > +	bool added = false;
> > > +
> > > +	if (WARN_ON(!pollflags))
> > > +		return false;
> > > +
> > > +	uitem = &ep->user_header->items[epi->bit];
> > > +	/*
> > > +	 * Can be represented as:
> > > +	 *
> > > +	 *    was_ready = uitem->ready_events;
> > > +	 *    uitem->ready_events &= ~EPOLLREMOVED;
> > > +	 *    uitem->ready_events |= pollflags;
> > > +	 *    if (!was_ready) {
> > > +	 *         // create index entry
> > > +	 *    }
> > > +	 *
> > > +	 * See the big comment inside ep_remove_user_item(), why it is
> > > +	 * important to mask EPOLLREMOVED.
> > > +	 */
> > > +	if (!atomic_or_with_mask(&uitem->ready_events,
> > > +				 pollflags, EPOLLREMOVED)) {
> > > +		unsigned int i, *item_idx, index_mask;
> > > +
> > > +		/*
> > > +		 * Item was not ready before, thus we have to insert
> > > +		 * new index to the ring.
> > > +		 */
> > > +
> > > +		index_mask = ep_max_index_nr(ep) - 1;
> > > +		i = __atomic_fetch_add(&ep->user_header->tail, 1,
> > > +				       __ATOMIC_ACQUIRE);
> > > +		item_idx = &ep->user_index[i & index_mask];
> > > +
> > > +		/* Signal with a bit, which is > 0 */
> > > +		*item_idx = epi->bit + 1;
> > 
> > Did you just increment the user visible tail pointer before you filled
> > the data? That is, can the concurrent userspace observe the increment
> > before you put credible data in its place?
> 
> No, the "data" is the "ready_events" mask, which was updated before,
> using cmpxchg, atomic_or_with_mask() call.  All I need is to put an
> index of just updated item to the uring.
> 
> Userspace, in its turn, gets the index from the ring and then checks
> the mask.

But where do you write the index into the shared memory? That index
should be written before you publish the new tail.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 07/13] epoll: call ep_add_event_to_uring() from ep_poll_callback()
  2019-05-31 11:22     ` Roman Penyaev
@ 2019-05-31 13:05       ` Peter Zijlstra
  2019-05-31 15:05         ` Roman Penyaev
  0 siblings, 1 reply; 50+ messages in thread
From: Peter Zijlstra @ 2019-05-31 13:05 UTC (permalink / raw)
  To: Roman Penyaev; +Cc: azat, akpm, viro, torvalds, linux-fsdevel, linux-kernel

On Fri, May 31, 2019 at 01:22:54PM +0200, Roman Penyaev wrote:
> On 2019-05-31 11:56, Peter Zijlstra wrote:
> > On Thu, May 16, 2019 at 10:58:04AM +0200, Roman Penyaev wrote:

> > > +static inline bool ep_clear_public_event_bits(struct epitem *epi)
> > > +{
> > > +	__poll_t old, flags;
> > > +
> > > +	/*
> > > +	 * Here we race with ourselves and with ep_modify(), which can
> > > +	 * change the event bits.  In order not to override events updated
> > > +	 * by ep_modify() we have to do cmpxchg.
> > > +	 */
> > > +
> > > +	old = epi->event.events;
> > > +	do {
> > > +		flags = old;
> > > +	} while ((old = cmpxchg(&epi->event.events, flags,
> > > +				flags & EP_PRIVATE_BITS)) != flags);
> > > +
> > > +	return flags & ~EP_PRIVATE_BITS;
> > > +}
> > 
> > AFAICT epi->event.events also has normal writes to it, eg. in
> > ep_modify(). A number of architectures cannot handle concurrent normal
> > writes and cmpxchg() to the same variable.
> 
> Yes, we race with the current function and with ep_modify().  Then,
> ep_modify()
> should do something as the following:
> 
> -	epi->event.events = event->events
> +	xchg(&epi->event.events, event->events);
> 
> Is that ok?

That should be correct, but at that point I think we should also always
read the thing with READ_ONCE() to avoid load-tearing. And I suspect it
then becomes sensible to change the type to atomic_t.

atomic_set() vs atomic_cmpxchg() only carries the extra overhead on
those 'dodgy' platforms.

> Just curious: what are these archs?

Oh, lovely stuff like parisc, sparc32 and arc-eznps. See
arch/parisc/lib/bitops.c:__cmpxchg_*() for example :/ Those systems only
have a single truly atomic op (something from the xchg / test-and-set
family) and the rest is fudged on top of that.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 06/13] epoll: introduce helpers for adding/removing events to uring
  2019-05-31 11:24     ` Roman Penyaev
@ 2019-05-31 13:11       ` Peter Zijlstra
  0 siblings, 0 replies; 50+ messages in thread
From: Peter Zijlstra @ 2019-05-31 13:11 UTC (permalink / raw)
  To: Roman Penyaev; +Cc: azat, akpm, viro, torvalds, linux-fsdevel, linux-kernel

On Fri, May 31, 2019 at 01:24:07PM +0200, Roman Penyaev wrote:
> On 2019-05-31 11:55, Peter Zijlstra wrote:
> > On Thu, May 16, 2019 at 10:58:03AM +0200, Roman Penyaev wrote:
> > > +#define atomic_set_unless_zero(ptr, flags)			\
> > > +({								\
> > > +	typeof(ptr) _ptr = (ptr);				\
> > > +	typeof(flags) _flags = (flags);				\
> > > +	typeof(*_ptr) _old, _val = READ_ONCE(*_ptr);		\
> > > +								\
> > > +	for (;;) {						\
> > > +		if (!_val)					\
> > > +			break;					\
> > > +		_old = cmpxchg(_ptr, _val, _flags);		\
> > > +		if (_old == _val)				\
> > > +			break;					\
> > > +		_val = _old;					\
> > > +	}							\
> > > +	_val;							\
> > > +})
> > 
> > > +#define atomic_or_with_mask(ptr, flags, mask)			\
> > > +({								\
> > > +	typeof(ptr) _ptr = (ptr);				\
> > > +	typeof(flags) _flags = (flags);				\
> > > +	typeof(flags) _mask = (mask);				\
> > > +	typeof(*_ptr) _old, _new, _val = READ_ONCE(*_ptr);	\
> > > +								\
> > > +	for (;;) {						\
> > > +		_new = (_val & ~_mask) | _flags;		\
> > > +		_old = cmpxchg(_ptr, _val, _new);		\
> > > +		if (_old == _val)				\
> > > +			break;					\
> > > +		_val = _old;					\
> > > +	}							\
> > > +	_val;							\
> > > +})
> > 
> > Don't call them atomic_*() if they're not part of the atomic_t
> > interface.
> 
> Can we add those two?  Or keep it local is better?

Afaict you're using them on the user visible values; we should not put
atomic_t into shared memory.

Your interface isn't compatible with those 'funny' architectures like
parisc etc. Since you expect userspace to do atomic ops on these
variables too. It is not a one-way interface ...

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 06/13] epoll: introduce helpers for adding/removing events to uring
  2019-05-31 12:56       ` Peter Zijlstra
@ 2019-05-31 14:21         ` Roman Penyaev
  2019-05-31 16:51           ` Peter Zijlstra
  0 siblings, 1 reply; 50+ messages in thread
From: Roman Penyaev @ 2019-05-31 14:21 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: azat, akpm, viro, torvalds, linux-fsdevel, linux-kernel

On 2019-05-31 14:56, Peter Zijlstra wrote:
> On Fri, May 31, 2019 at 01:15:21PM +0200, Roman Penyaev wrote:
>> On 2019-05-31 11:56, Peter Zijlstra wrote:
>> > On Thu, May 16, 2019 at 10:58:03AM +0200, Roman Penyaev wrote:
>> > > +static inline bool ep_add_event_to_uring(struct epitem *epi,
>> > > __poll_t pollflags)
>> > > +{
>> > > +	struct eventpoll *ep = epi->ep;
>> > > +	struct epoll_uitem *uitem;
>> > > +	bool added = false;
>> > > +
>> > > +	if (WARN_ON(!pollflags))
>> > > +		return false;
>> > > +
>> > > +	uitem = &ep->user_header->items[epi->bit];
>> > > +	/*
>> > > +	 * Can be represented as:
>> > > +	 *
>> > > +	 *    was_ready = uitem->ready_events;
>> > > +	 *    uitem->ready_events &= ~EPOLLREMOVED;
>> > > +	 *    uitem->ready_events |= pollflags;
>> > > +	 *    if (!was_ready) {
>> > > +	 *         // create index entry
>> > > +	 *    }
>> > > +	 *
>> > > +	 * See the big comment inside ep_remove_user_item(), why it is
>> > > +	 * important to mask EPOLLREMOVED.
>> > > +	 */
>> > > +	if (!atomic_or_with_mask(&uitem->ready_events,
>> > > +				 pollflags, EPOLLREMOVED)) {
>> > > +		unsigned int i, *item_idx, index_mask;
>> > > +
>> > > +		/*
>> > > +		 * Item was not ready before, thus we have to insert
>> > > +		 * new index to the ring.
>> > > +		 */
>> > > +
>> > > +		index_mask = ep_max_index_nr(ep) - 1;
>> > > +		i = __atomic_fetch_add(&ep->user_header->tail, 1,
>> > > +				       __ATOMIC_ACQUIRE);
>> > > +		item_idx = &ep->user_index[i & index_mask];
>> > > +
>> > > +		/* Signal with a bit, which is > 0 */
>> > > +		*item_idx = epi->bit + 1;
>> >
>> > Did you just increment the user visible tail pointer before you filled
>> > the data? That is, can the concurrent userspace observe the increment
>> > before you put credible data in its place?
>> 
>> No, the "data" is the "ready_events" mask, which was updated before,
>> using cmpxchg, atomic_or_with_mask() call.  All I need is to put an
>> index of just updated item to the uring.
>> 
>> Userspace, in its turn, gets the index from the ring and then checks
>> the mask.
> 
> But where do you write the index into the shared memory? That index
> should be written before you publish the new tail.

The ep_add_event_to_uring() is lockless, thus I can't increase tail 
after,
I need to reserve the index slot, where to write to.  I can use shadow 
tail,
which is not seen by userspace, but I have to guarantee that tail is 
updated
with shadow tail *after* all callers of ep_add_event_to_uring() are 
left.
That is possible, please see the code below, but it adds more 
complexity:

(code was tested on user side, thus has c11 atomics)

static inline void add_event__kernel(struct ring *ring, unsigned bit)
{
         unsigned i, cntr, commit_cntr, *item_idx, tail, old;

         i = __atomic_fetch_add(&ring->cntr, 1, __ATOMIC_ACQUIRE);
         item_idx = &ring->user_itemsindex[i % ring->nr];

         /* Update data */
         *item_idx = bit;

         commit_cntr = __atomic_add_fetch(&ring->commit_cntr, 1, 
__ATOMIC_RELEASE);

         tail = ring->user_header->tail;
         rmb();
         do {
                 cntr = ring->cntr;
                 if (cntr != commit_cntr)
                         /* Someone else will advance tail */
                         break;

                 old = tail;

         } while ((tail = 
__sync_val_compare_and_swap(&ring->user_header->tail, old, cntr)) != 
old);
}

Another way (current solution) is to spin on userspace side in order to 
get
index > 0 (valid index is always > 0), i.e.:

	item_idx_ptr = &index[idx & indeces_mask];

	/*
	 * Spin here till we see valid index
	 */
	while (!(idx = __atomic_load_n(item_idx_ptr, __ATOMIC_ACQUIRE)))
		;



So of course tail can be updated after, like you mentioned, but then I 
have
to introduce locks.  I want to keep it lockless on hot event path.

--
Roman



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 06/13] epoll: introduce helpers for adding/removing events to uring
  2019-05-31 12:53       ` Peter Zijlstra
@ 2019-05-31 14:28         ` Roman Penyaev
  2019-05-31 16:53           ` Peter Zijlstra
  0 siblings, 1 reply; 50+ messages in thread
From: Roman Penyaev @ 2019-05-31 14:28 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: azat, akpm, viro, torvalds, linux-fsdevel, linux-kernel

On 2019-05-31 14:53, Peter Zijlstra wrote:
> On Fri, May 31, 2019 at 01:15:21PM +0200, Roman Penyaev wrote:
>> On 2019-05-31 11:56, Peter Zijlstra wrote:
>> > On Thu, May 16, 2019 at 10:58:03AM +0200, Roman Penyaev wrote:
> 
>> > > +		i = __atomic_fetch_add(&ep->user_header->tail, 1,
>> > > +				       __ATOMIC_ACQUIRE);
>> >
>> > afaict __atomic_fetch_add() does not exist.
>> 
>> That is gcc extension.  I did not find any API just to increment
>> the variable atomically without using/casting to atomic.  What
>> is a proper way to achieve that?
> 
> That's C11 atomics, and those shall not be used in the kernel. For one
> they're not available in the minimally required GCC version (4.6).
> 
> The proper and only way is to use atomic_t, but also you cannot share
> atomic_t with userspace.

Yes, that what I tried to avoid choosing c11 extension.

> 
> The normal way of doing something like this is to have a kernel private
> atomic_t and copy the value out to userspace using smp_store_release().

Since this path is lockless unfortunately that won't work.  So seems
the only way is to do one more cmpxchg (sigh) or give up and take a
look (sad sigh).

--
Roman

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 00/13] epoll: support pollable epoll from userspace
  2019-05-16  8:57 [PATCH v3 00/13] epoll: support pollable epoll from userspace Roman Penyaev
                   ` (13 preceding siblings ...)
  2019-05-31  9:55 ` [PATCH v3 00/13] epoll: support pollable epoll from userspace Peter Zijlstra
@ 2019-05-31 14:48 ` Jens Axboe
  2019-05-31 16:02   ` Roman Penyaev
  2019-05-31 16:33 ` Peter Zijlstra
  15 siblings, 1 reply; 50+ messages in thread
From: Jens Axboe @ 2019-05-31 14:48 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: Azat Khuzhin, Andrew Morton, Al Viro, Linus Torvalds,
	linux-fsdevel, linux-kernel

On 5/16/19 2:57 AM, Roman Penyaev wrote:
> Hi all,
> 
> This is v3 which introduces pollable epoll from userspace.
> 
> v3:
>   - Measurements made, represented below.
> 
>   - Fix alignment for epoll_uitem structure on all 64-bit archs except
>     x86-64. epoll_uitem should be always 16 bit, proper BUILD_BUG_ON
>     is added. (Linus)
> 
>   - Check pollflags explicitly on 0 inside work callback, and do nothing
>     if 0.
> 
> v2:
>   - No reallocations, the max number of items (thus size of the user ring)
>     is specified by the caller.
> 
>   - Interface is simplified: -ENOSPC is returned on attempt to add a new
>     epoll item if number is reached the max, nothing more.
> 
>   - Alloced pages are accounted using user->locked_vm and limited to
>     RLIMIT_MEMLOCK value.
> 
>   - EPOLLONESHOT is handled.
> 
> This series introduces pollable epoll from userspace, i.e. user creates
> epfd with a new EPOLL_USERPOLL flag, mmaps epoll descriptor, gets header
> and ring pointers and then consumes ready events from a ring, avoiding
> epoll_wait() call.  When ring is empty, user has to call epoll_wait()
> in order to wait for new events.  epoll_wait() returns -ESTALE if user
> ring has events in the ring (kind of indication, that user has to consume
> events from the user ring first, I could not invent anything better than
> returning -ESTALE).
> 
> For user header and user ring allocation I used vmalloc_user().  I found
> that it is much easy to reuse remap_vmalloc_range_partial() instead of
> dealing with page cache (like aio.c does).  What is also nice is that
> virtual address is properly aligned on SHMLBA, thus there should not be
> any d-cache aliasing problems on archs with vivt or vipt caches.

Why aren't we just adding support to io_uring for this instead? Then we
don't need yet another entirely new ring, that's is just a little
different from what we have.

I haven't looked into the details of your implementation, just curious
if there's anything that makes using io_uring a non-starter for this
purpose?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 07/13] epoll: call ep_add_event_to_uring() from ep_poll_callback()
  2019-05-31 13:05       ` Peter Zijlstra
@ 2019-05-31 15:05         ` Roman Penyaev
  0 siblings, 0 replies; 50+ messages in thread
From: Roman Penyaev @ 2019-05-31 15:05 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: azat, akpm, viro, torvalds, linux-fsdevel, linux-kernel

On 2019-05-31 15:05, Peter Zijlstra wrote:
> On Fri, May 31, 2019 at 01:22:54PM +0200, Roman Penyaev wrote:
>> On 2019-05-31 11:56, Peter Zijlstra wrote:
>> > On Thu, May 16, 2019 at 10:58:04AM +0200, Roman Penyaev wrote:
> 
>> > > +static inline bool ep_clear_public_event_bits(struct epitem *epi)
>> > > +{
>> > > +	__poll_t old, flags;
>> > > +
>> > > +	/*
>> > > +	 * Here we race with ourselves and with ep_modify(), which can
>> > > +	 * change the event bits.  In order not to override events updated
>> > > +	 * by ep_modify() we have to do cmpxchg.
>> > > +	 */
>> > > +
>> > > +	old = epi->event.events;
>> > > +	do {
>> > > +		flags = old;
>> > > +	} while ((old = cmpxchg(&epi->event.events, flags,
>> > > +				flags & EP_PRIVATE_BITS)) != flags);
>> > > +
>> > > +	return flags & ~EP_PRIVATE_BITS;
>> > > +}
>> >
>> > AFAICT epi->event.events also has normal writes to it, eg. in
>> > ep_modify(). A number of architectures cannot handle concurrent normal
>> > writes and cmpxchg() to the same variable.
>> 
>> Yes, we race with the current function and with ep_modify().  Then,
>> ep_modify()
>> should do something as the following:
>> 
>> -	epi->event.events = event->events
>> +	xchg(&epi->event.events, event->events);
>> 
>> Is that ok?
> 
> That should be correct, but at that point I think we should also always
> read the thing with READ_ONCE() to avoid load-tearing. And I suspect it
> then becomes sensible to change the type to atomic_t.

But it seems if we afraid of load tearing that should be fixed 
separately,
independently of this patchset, because epi->event.events is updated
in ep_modify() and races with ep_poll_callback(), which reads the value
in couple of places.

Probably nothing terrible will happen, because eventually event comes
or just ignored.


> atomic_set() vs atomic_cmpxchg() only carries the extra overhead on
> those 'dodgy' platforms.
> 
>> Just curious: what are these archs?
> 
> Oh, lovely stuff like parisc, sparc32 and arc-eznps. See
> arch/parisc/lib/bitops.c:__cmpxchg_*() for example :/ Those systems 
> only
> have a single truly atomic op (something from the xchg / test-and-set
> family) and the rest is fudged on top of that.

Locks, nice.

--
Roman



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 00/13] epoll: support pollable epoll from userspace
  2019-05-31 14:48 ` Jens Axboe
@ 2019-05-31 16:02   ` Roman Penyaev
  2019-05-31 16:54     ` Jens Axboe
  0 siblings, 1 reply; 50+ messages in thread
From: Roman Penyaev @ 2019-05-31 16:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Azat Khuzhin, Andrew Morton, Al Viro, Linus Torvalds,
	linux-fsdevel, linux-kernel

On 2019-05-31 16:48, Jens Axboe wrote:
> On 5/16/19 2:57 AM, Roman Penyaev wrote:
>> Hi all,
>> 
>> This is v3 which introduces pollable epoll from userspace.
>> 
>> v3:
>>   - Measurements made, represented below.
>> 
>>   - Fix alignment for epoll_uitem structure on all 64-bit archs except
>>     x86-64. epoll_uitem should be always 16 bit, proper BUILD_BUG_ON
>>     is added. (Linus)
>> 
>>   - Check pollflags explicitly on 0 inside work callback, and do 
>> nothing
>>     if 0.
>> 
>> v2:
>>   - No reallocations, the max number of items (thus size of the user 
>> ring)
>>     is specified by the caller.
>> 
>>   - Interface is simplified: -ENOSPC is returned on attempt to add a 
>> new
>>     epoll item if number is reached the max, nothing more.
>> 
>>   - Alloced pages are accounted using user->locked_vm and limited to
>>     RLIMIT_MEMLOCK value.
>> 
>>   - EPOLLONESHOT is handled.
>> 
>> This series introduces pollable epoll from userspace, i.e. user 
>> creates
>> epfd with a new EPOLL_USERPOLL flag, mmaps epoll descriptor, gets 
>> header
>> and ring pointers and then consumes ready events from a ring, avoiding
>> epoll_wait() call.  When ring is empty, user has to call epoll_wait()
>> in order to wait for new events.  epoll_wait() returns -ESTALE if user
>> ring has events in the ring (kind of indication, that user has to 
>> consume
>> events from the user ring first, I could not invent anything better 
>> than
>> returning -ESTALE).
>> 
>> For user header and user ring allocation I used vmalloc_user().  I 
>> found
>> that it is much easy to reuse remap_vmalloc_range_partial() instead of
>> dealing with page cache (like aio.c does).  What is also nice is that
>> virtual address is properly aligned on SHMLBA, thus there should not 
>> be
>> any d-cache aliasing problems on archs with vivt or vipt caches.
> 
> Why aren't we just adding support to io_uring for this instead? Then we
> don't need yet another entirely new ring, that's is just a little
> different from what we have.
> 
> I haven't looked into the details of your implementation, just curious
> if there's anything that makes using io_uring a non-starter for this
> purpose?

Afaict the main difference is that you do not need to recharge an fd
(submit new poll request in terms of io_uring): once fd has been added 
to
epoll with epoll_ctl() - we get events.  When you have thousands of fds 
-
that should matter.

Also interesting question is how difficult to modify existing event 
loops
in event libraries in order to support recharging (EPOLLONESHOT in terms
of epoll).

Maybe Azat who maintains libevent can shed light on this (currently I 
see
that libevent does not support "EPOLLONESHOT" logic).


--
Roman



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 00/13] epoll: support pollable epoll from userspace
  2019-05-16  8:57 [PATCH v3 00/13] epoll: support pollable epoll from userspace Roman Penyaev
                   ` (14 preceding siblings ...)
  2019-05-31 14:48 ` Jens Axboe
@ 2019-05-31 16:33 ` Peter Zijlstra
  2019-05-31 18:50   ` Roman Penyaev
  15 siblings, 1 reply; 50+ messages in thread
From: Peter Zijlstra @ 2019-05-31 16:33 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: azat, rpenyaev, akpm, viro, torvalds, linux-fsdevel, linux-kernel

On Thu, May 16, 2019 at 10:57:57AM +0200, Roman Penyaev wrote:
> When new event comes for some epoll item kernel does the following:
> 
>  struct epoll_uitem *uitem;
> 
>  /* Each item has a bit (index in user items array), discussed later */
>  uitem = user_header->items[epi->bit];
> 
>  if (!atomic_fetch_or(uitem->ready_events, pollflags)) {
>      i = atomic_add(&ep->user_header->tail, 1);

So this is where you increment tail

> 
>      item_idx = &user_index[i & index_mask];
> 
>      /* Signal with a bit, user spins on index expecting value > 0 */
>      *item_idx = idx + 1;

IUC, this is where you write the idx into shared memory, which is
_after_ tail has already been incremented.

>  }
> 
> Important thing here is that ring can't infinitely grow and corrupt other
> elements, because kernel always checks that item was marked as ready, so
> userspace has to clear ready_events field.
> 
> On userside events the following code should be used in order to consume
> events:
> 
>  tail = READ_ONCE(header->tail);
>  for (i = 0; header->head != tail; header->head++) {
>      item_idx_ptr = &index[idx & indeces_mask];
> 
>      /*
>       * Spin here till we see valid index
>       */
>      while (!(idx = __atomic_load_n(item_idx_ptr, __ATOMIC_ACQUIRE)))
>          ;

Which you then try and fix up by busy waiting for @idx to become !0 ?!

Why not write the idx first, then increment the ->tail, such that when
we see ->tail, we already know idx must be correct?

> 
>      item = &header->items[idx - 1];
> 
>      /*
>       * Mark index as invalid, that is for userspace only, kernel does not care
>       * and will refill this pointer only when observes that event is cleared,
>       * which happens below.
>       */
>      *item_idx_ptr = 0;

That avoids this store too.

> 
>      /*
>       * Fetch data first, if event is cleared by the kernel we drop the data
>       * returning false.
>       */
>      event->data = item->event.data;
>      event->events = __atomic_exchange_n(&item->ready_events, 0,
>                          __ATOMIC_RELEASE);
> 
>  }

Aside from that, you have to READ/WRITE_ONCE() on ->head, to avoid
load/store tearing.


That would give something like:

kernel:

	slot = atomic_fetch_inc(&ep->slot);
	item_idx = &user_index[slot & idx_mask];
	WRITE_ONCE(*item_idx, idx);
	smp_store_release(&ep->user_header->tail, slot);

userspace:

	tail = smp_load_acquire(&header->tail);
	for (head = READ_ONCE(header->head); head != tail; head++) {
		idx = READ_ONCE(index[head & idx_mask]);
		itemp = &header->items[idx];

		...
	}
	smp_store_release(&header->head, head);



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 06/13] epoll: introduce helpers for adding/removing events to uring
  2019-05-31 14:21         ` Roman Penyaev
@ 2019-05-31 16:51           ` Peter Zijlstra
  2019-05-31 18:58             ` Roman Penyaev
  0 siblings, 1 reply; 50+ messages in thread
From: Peter Zijlstra @ 2019-05-31 16:51 UTC (permalink / raw)
  To: Roman Penyaev; +Cc: azat, akpm, viro, torvalds, linux-fsdevel, linux-kernel

On Fri, May 31, 2019 at 04:21:30PM +0200, Roman Penyaev wrote:

> The ep_add_event_to_uring() is lockless, thus I can't increase tail after,
> I need to reserve the index slot, where to write to.  I can use shadow tail,
> which is not seen by userspace, but I have to guarantee that tail is updated
> with shadow tail *after* all callers of ep_add_event_to_uring() are left.
> That is possible, please see the code below, but it adds more complexity:
> 
> (code was tested on user side, thus has c11 atomics)
> 
> static inline void add_event__kernel(struct ring *ring, unsigned bit)
> {
>         unsigned i, cntr, commit_cntr, *item_idx, tail, old;
> 
>         i = __atomic_fetch_add(&ring->cntr, 1, __ATOMIC_ACQUIRE);
>         item_idx = &ring->user_itemsindex[i % ring->nr];
> 
>         /* Update data */
>         *item_idx = bit;
> 
>         commit_cntr = __atomic_add_fetch(&ring->commit_cntr, 1,
> __ATOMIC_RELEASE);
> 
>         tail = ring->user_header->tail;
>         rmb();
>         do {
>                 cntr = ring->cntr;
>                 if (cntr != commit_cntr)
>                         /* Someone else will advance tail */
>                         break;
> 
>                 old = tail;
> 
>         } while ((tail =
> __sync_val_compare_and_swap(&ring->user_header->tail, old, cntr)) != old);
> }

Yes, I'm well aware of that particular problem (see
kernel/events/ring_buffer.c:perf_output_put_handle for instance). But
like you show, it can be done. It also makes the thing wait-free, as
opposed to merely lockless.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 06/13] epoll: introduce helpers for adding/removing events to uring
  2019-05-31 14:28         ` Roman Penyaev
@ 2019-05-31 16:53           ` Peter Zijlstra
  0 siblings, 0 replies; 50+ messages in thread
From: Peter Zijlstra @ 2019-05-31 16:53 UTC (permalink / raw)
  To: Roman Penyaev; +Cc: azat, akpm, viro, torvalds, linux-fsdevel, linux-kernel

On Fri, May 31, 2019 at 04:28:36PM +0200, Roman Penyaev wrote:
> On 2019-05-31 14:53, Peter Zijlstra wrote:
> > On Fri, May 31, 2019 at 01:15:21PM +0200, Roman Penyaev wrote:
> > > On 2019-05-31 11:56, Peter Zijlstra wrote:
> > > > On Thu, May 16, 2019 at 10:58:03AM +0200, Roman Penyaev wrote:
> > 
> > > > > +		i = __atomic_fetch_add(&ep->user_header->tail, 1,
> > > > > +				       __ATOMIC_ACQUIRE);
> > > >
> > > > afaict __atomic_fetch_add() does not exist.
> > > 
> > > That is gcc extension.  I did not find any API just to increment
> > > the variable atomically without using/casting to atomic.  What
> > > is a proper way to achieve that?
> > 
> > That's C11 atomics, and those shall not be used in the kernel. For one
> > they're not available in the minimally required GCC version (4.6).
> > 
> > The proper and only way is to use atomic_t, but also you cannot share
> > atomic_t with userspace.
> 
> Yes, that what I tried to avoid choosing c11 extension.
> 
> > 
> > The normal way of doing something like this is to have a kernel private
> > atomic_t and copy the value out to userspace using smp_store_release().
> 
> Since this path is lockless unfortunately that won't work.  So seems
> the only way is to do one more cmpxchg (sigh) or give up and take a
> look (sad sigh).

Of course it works; most of the extant buffers already have a shadow
tail and do the update to userspace with a store-release.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 00/13] epoll: support pollable epoll from userspace
  2019-05-31 16:02   ` Roman Penyaev
@ 2019-05-31 16:54     ` Jens Axboe
  2019-05-31 19:45       ` Roman Penyaev
  0 siblings, 1 reply; 50+ messages in thread
From: Jens Axboe @ 2019-05-31 16:54 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: Azat Khuzhin, Andrew Morton, Al Viro, Linus Torvalds,
	linux-fsdevel, linux-kernel

On 5/31/19 10:02 AM, Roman Penyaev wrote:
> On 2019-05-31 16:48, Jens Axboe wrote:
>> On 5/16/19 2:57 AM, Roman Penyaev wrote:
>>> Hi all,
>>>
>>> This is v3 which introduces pollable epoll from userspace.
>>>
>>> v3:
>>>    - Measurements made, represented below.
>>>
>>>    - Fix alignment for epoll_uitem structure on all 64-bit archs except
>>>      x86-64. epoll_uitem should be always 16 bit, proper BUILD_BUG_ON
>>>      is added. (Linus)
>>>
>>>    - Check pollflags explicitly on 0 inside work callback, and do
>>> nothing
>>>      if 0.
>>>
>>> v2:
>>>    - No reallocations, the max number of items (thus size of the user
>>> ring)
>>>      is specified by the caller.
>>>
>>>    - Interface is simplified: -ENOSPC is returned on attempt to add a
>>> new
>>>      epoll item if number is reached the max, nothing more.
>>>
>>>    - Alloced pages are accounted using user->locked_vm and limited to
>>>      RLIMIT_MEMLOCK value.
>>>
>>>    - EPOLLONESHOT is handled.
>>>
>>> This series introduces pollable epoll from userspace, i.e. user
>>> creates
>>> epfd with a new EPOLL_USERPOLL flag, mmaps epoll descriptor, gets
>>> header
>>> and ring pointers and then consumes ready events from a ring, avoiding
>>> epoll_wait() call.  When ring is empty, user has to call epoll_wait()
>>> in order to wait for new events.  epoll_wait() returns -ESTALE if user
>>> ring has events in the ring (kind of indication, that user has to
>>> consume
>>> events from the user ring first, I could not invent anything better
>>> than
>>> returning -ESTALE).
>>>
>>> For user header and user ring allocation I used vmalloc_user().  I
>>> found
>>> that it is much easy to reuse remap_vmalloc_range_partial() instead of
>>> dealing with page cache (like aio.c does).  What is also nice is that
>>> virtual address is properly aligned on SHMLBA, thus there should not
>>> be
>>> any d-cache aliasing problems on archs with vivt or vipt caches.
>>
>> Why aren't we just adding support to io_uring for this instead? Then we
>> don't need yet another entirely new ring, that's is just a little
>> different from what we have.
>>
>> I haven't looked into the details of your implementation, just curious
>> if there's anything that makes using io_uring a non-starter for this
>> purpose?
> 
> Afaict the main difference is that you do not need to recharge an fd
> (submit new poll request in terms of io_uring): once fd has been added
> to
> epoll with epoll_ctl() - we get events.  When you have thousands of fds
> -
> that should matter.
> 
> Also interesting question is how difficult to modify existing event
> loops
> in event libraries in order to support recharging (EPOLLONESHOT in terms
> of epoll).
> 
> Maybe Azat who maintains libevent can shed light on this (currently I
> see
> that libevent does not support "EPOLLONESHOT" logic).

In terms of existing io_uring poll support, which is what I'm guessing
you're referring to, it is indeed just one-shot. But there's no reason
why we can't have it persist until explicitly canceled with POLL_REMOVE.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 00/13] epoll: support pollable epoll from userspace
  2019-05-31 16:33 ` Peter Zijlstra
@ 2019-05-31 18:50   ` Roman Penyaev
  0 siblings, 0 replies; 50+ messages in thread
From: Roman Penyaev @ 2019-05-31 18:50 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: azat, akpm, viro, torvalds, linux-fsdevel, linux-kernel

On 2019-05-31 18:33, Peter Zijlstra wrote:
> On Thu, May 16, 2019 at 10:57:57AM +0200, Roman Penyaev wrote:
>> When new event comes for some epoll item kernel does the following:
>> 
>>  struct epoll_uitem *uitem;
>> 
>>  /* Each item has a bit (index in user items array), discussed later 
>> */
>>  uitem = user_header->items[epi->bit];
>> 
>>  if (!atomic_fetch_or(uitem->ready_events, pollflags)) {
>>      i = atomic_add(&ep->user_header->tail, 1);
> 
> So this is where you increment tail
> 
>> 
>>      item_idx = &user_index[i & index_mask];
>> 
>>      /* Signal with a bit, user spins on index expecting value > 0 */
>>      *item_idx = idx + 1;
> 
> IUC, this is where you write the idx into shared memory, which is
> _after_ tail has already been incremented.
> 
>>  }
>> 
>> Important thing here is that ring can't infinitely grow and corrupt 
>> other
>> elements, because kernel always checks that item was marked as ready, 
>> so
>> userspace has to clear ready_events field.
>> 
>> On userside events the following code should be used in order to 
>> consume
>> events:
>> 
>>  tail = READ_ONCE(header->tail);
>>  for (i = 0; header->head != tail; header->head++) {
>>      item_idx_ptr = &index[idx & indeces_mask];
>> 
>>      /*
>>       * Spin here till we see valid index
>>       */
>>      while (!(idx = __atomic_load_n(item_idx_ptr, __ATOMIC_ACQUIRE)))
>>          ;
> 
> Which you then try and fix up by busy waiting for @idx to become !0 ?!
> 
> Why not write the idx first, then increment the ->tail, such that when
> we see ->tail, we already know idx must be correct?
> 
>> 
>>      item = &header->items[idx - 1];
>> 
>>      /*
>>       * Mark index as invalid, that is for userspace only, kernel does 
>> not care
>>       * and will refill this pointer only when observes that event is 
>> cleared,
>>       * which happens below.
>>       */
>>      *item_idx_ptr = 0;
> 
> That avoids this store too.
> 
>> 
>>      /*
>>       * Fetch data first, if event is cleared by the kernel we drop 
>> the data
>>       * returning false.
>>       */
>>      event->data = item->event.data;
>>      event->events = __atomic_exchange_n(&item->ready_events, 0,
>>                          __ATOMIC_RELEASE);
>> 
>>  }
> 
> Aside from that, you have to READ/WRITE_ONCE() on ->head, to avoid
> load/store tearing.

Yes, clear. Thanks.

> 
> 
> That would give something like:
> 
> kernel:
> 
> 	slot = atomic_fetch_inc(&ep->slot);
> 	item_idx = &user_index[slot & idx_mask];
> 	WRITE_ONCE(*item_idx, idx);
> 	smp_store_release(&ep->user_header->tail, slot);

This can't be called from many cpus,  tail can be overwritten with "old"
value.  That what I try to solve.

--
Roman


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 06/13] epoll: introduce helpers for adding/removing events to uring
  2019-05-31 16:51           ` Peter Zijlstra
@ 2019-05-31 18:58             ` Roman Penyaev
  2019-06-03  9:09               ` Peter Zijlstra
  0 siblings, 1 reply; 50+ messages in thread
From: Roman Penyaev @ 2019-05-31 18:58 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: azat, akpm, viro, torvalds, linux-fsdevel, linux-kernel

On 2019-05-31 18:51, Peter Zijlstra wrote:
> On Fri, May 31, 2019 at 04:21:30PM +0200, Roman Penyaev wrote:
> 
>> The ep_add_event_to_uring() is lockless, thus I can't increase tail 
>> after,
>> I need to reserve the index slot, where to write to.  I can use shadow 
>> tail,
>> which is not seen by userspace, but I have to guarantee that tail is 
>> updated
>> with shadow tail *after* all callers of ep_add_event_to_uring() are 
>> left.
>> That is possible, please see the code below, but it adds more 
>> complexity:
>> 
>> (code was tested on user side, thus has c11 atomics)
>> 
>> static inline void add_event__kernel(struct ring *ring, unsigned bit)
>> {
>>         unsigned i, cntr, commit_cntr, *item_idx, tail, old;
>> 
>>         i = __atomic_fetch_add(&ring->cntr, 1, __ATOMIC_ACQUIRE);
>>         item_idx = &ring->user_itemsindex[i % ring->nr];
>> 
>>         /* Update data */
>>         *item_idx = bit;
>> 
>>         commit_cntr = __atomic_add_fetch(&ring->commit_cntr, 1,
>> __ATOMIC_RELEASE);
>> 
>>         tail = ring->user_header->tail;
>>         rmb();
>>         do {
>>                 cntr = ring->cntr;
>>                 if (cntr != commit_cntr)
>>                         /* Someone else will advance tail */
>>                         break;
>> 
>>                 old = tail;
>> 
>>         } while ((tail =
>> __sync_val_compare_and_swap(&ring->user_header->tail, old, cntr)) != 
>> old);
>> }
> 
> Yes, I'm well aware of that particular problem (see
> kernel/events/ring_buffer.c:perf_output_put_handle for instance).

I'll take a look, thanks.

> But like you show, it can be done. It also makes the thing wait-free, 
> as
> opposed to merely lockless.

You think it's better?  I did not like this variant from the very
beginning because of the unnecessary complexity.  But maybe you're
right.  No busy loops on user side makes it wait-free.  And also
I can avoid c11 in kernel using cmpxchg along with atomic_t.

--
Roman



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 00/13] epoll: support pollable epoll from userspace
  2019-05-31 16:54     ` Jens Axboe
@ 2019-05-31 19:45       ` Roman Penyaev
  2019-05-31 21:09         ` Jens Axboe
  0 siblings, 1 reply; 50+ messages in thread
From: Roman Penyaev @ 2019-05-31 19:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Azat Khuzhin, Andrew Morton, Al Viro, Linus Torvalds,
	linux-fsdevel, linux-kernel

On 2019-05-31 18:54, Jens Axboe wrote:
> On 5/31/19 10:02 AM, Roman Penyaev wrote:
>> On 2019-05-31 16:48, Jens Axboe wrote:
>>> On 5/16/19 2:57 AM, Roman Penyaev wrote:
>>>> Hi all,
>>>> 
>>>> This is v3 which introduces pollable epoll from userspace.
>>>> 
>>>> v3:
>>>>    - Measurements made, represented below.
>>>> 
>>>>    - Fix alignment for epoll_uitem structure on all 64-bit archs 
>>>> except
>>>>      x86-64. epoll_uitem should be always 16 bit, proper 
>>>> BUILD_BUG_ON
>>>>      is added. (Linus)
>>>> 
>>>>    - Check pollflags explicitly on 0 inside work callback, and do
>>>> nothing
>>>>      if 0.
>>>> 
>>>> v2:
>>>>    - No reallocations, the max number of items (thus size of the 
>>>> user
>>>> ring)
>>>>      is specified by the caller.
>>>> 
>>>>    - Interface is simplified: -ENOSPC is returned on attempt to add 
>>>> a
>>>> new
>>>>      epoll item if number is reached the max, nothing more.
>>>> 
>>>>    - Alloced pages are accounted using user->locked_vm and limited 
>>>> to
>>>>      RLIMIT_MEMLOCK value.
>>>> 
>>>>    - EPOLLONESHOT is handled.
>>>> 
>>>> This series introduces pollable epoll from userspace, i.e. user
>>>> creates
>>>> epfd with a new EPOLL_USERPOLL flag, mmaps epoll descriptor, gets
>>>> header
>>>> and ring pointers and then consumes ready events from a ring, 
>>>> avoiding
>>>> epoll_wait() call.  When ring is empty, user has to call 
>>>> epoll_wait()
>>>> in order to wait for new events.  epoll_wait() returns -ESTALE if 
>>>> user
>>>> ring has events in the ring (kind of indication, that user has to
>>>> consume
>>>> events from the user ring first, I could not invent anything better
>>>> than
>>>> returning -ESTALE).
>>>> 
>>>> For user header and user ring allocation I used vmalloc_user().  I
>>>> found
>>>> that it is much easy to reuse remap_vmalloc_range_partial() instead 
>>>> of
>>>> dealing with page cache (like aio.c does).  What is also nice is 
>>>> that
>>>> virtual address is properly aligned on SHMLBA, thus there should not
>>>> be
>>>> any d-cache aliasing problems on archs with vivt or vipt caches.
>>> 
>>> Why aren't we just adding support to io_uring for this instead? Then 
>>> we
>>> don't need yet another entirely new ring, that's is just a little
>>> different from what we have.
>>> 
>>> I haven't looked into the details of your implementation, just 
>>> curious
>>> if there's anything that makes using io_uring a non-starter for this
>>> purpose?
>> 
>> Afaict the main difference is that you do not need to recharge an fd
>> (submit new poll request in terms of io_uring): once fd has been added
>> to
>> epoll with epoll_ctl() - we get events.  When you have thousands of 
>> fds
>> -
>> that should matter.
>> 
>> Also interesting question is how difficult to modify existing event
>> loops
>> in event libraries in order to support recharging (EPOLLONESHOT in 
>> terms
>> of epoll).
>> 
>> Maybe Azat who maintains libevent can shed light on this (currently I
>> see
>> that libevent does not support "EPOLLONESHOT" logic).
> 
> In terms of existing io_uring poll support, which is what I'm guessing
> you're referring to, it is indeed just one-shot.

Yes, yes.

> But there's no reason  why we can't have it persist until explicitly
> canceled with POLL_REMOVE.

It seems not so easy.  The main problem is that with only a ring it is
impossible to figure out on kernel side what event bits have been 
already
seen by the userspace and what bits are new.  So every new cqe has to
be added to a completion ring on each wake_up_interruptible() call.
(I mean when fd wants to report that something is ready).

IMO that can lead to many duplicate events (tens? hundreds? honestly no
idea), which userspace has to handle with subsequent read/write calls.
It can kill all performance benefits of a uring.

In uepoll this is solved with another piece of shared memory, where
userspace atomically clears bits and kernel side sets bits.  If kernel
observes that bits were set (i.e. userspace has not seen this event)
- new index is added to a ring.

Can we extend the io_uring API to support this behavior?  Also would
be great if we can make event path lockless.  On a big number of fds
and frequent events - this matters, please take a look, recently I
did some measurements:  https://lkml.org/lkml/2018/12/12/305

--
Roman


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 00/13] epoll: support pollable epoll from userspace
  2019-05-31 19:45       ` Roman Penyaev
@ 2019-05-31 21:09         ` Jens Axboe
  2019-06-05  6:17           ` Roman Penyaev
  0 siblings, 1 reply; 50+ messages in thread
From: Jens Axboe @ 2019-05-31 21:09 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: Azat Khuzhin, Andrew Morton, Al Viro, Linus Torvalds,
	linux-fsdevel, linux-kernel

On 5/31/19 1:45 PM, Roman Penyaev wrote:
> On 2019-05-31 18:54, Jens Axboe wrote:
>> On 5/31/19 10:02 AM, Roman Penyaev wrote:
>>> On 2019-05-31 16:48, Jens Axboe wrote:
>>>> On 5/16/19 2:57 AM, Roman Penyaev wrote:
>>>>> Hi all,
>>>>>
>>>>> This is v3 which introduces pollable epoll from userspace.
>>>>>
>>>>> v3:
>>>>>     - Measurements made, represented below.
>>>>>
>>>>>     - Fix alignment for epoll_uitem structure on all 64-bit archs
>>>>> except
>>>>>       x86-64. epoll_uitem should be always 16 bit, proper
>>>>> BUILD_BUG_ON
>>>>>       is added. (Linus)
>>>>>
>>>>>     - Check pollflags explicitly on 0 inside work callback, and do
>>>>> nothing
>>>>>       if 0.
>>>>>
>>>>> v2:
>>>>>     - No reallocations, the max number of items (thus size of the
>>>>> user
>>>>> ring)
>>>>>       is specified by the caller.
>>>>>
>>>>>     - Interface is simplified: -ENOSPC is returned on attempt to add
>>>>> a
>>>>> new
>>>>>       epoll item if number is reached the max, nothing more.
>>>>>
>>>>>     - Alloced pages are accounted using user->locked_vm and limited
>>>>> to
>>>>>       RLIMIT_MEMLOCK value.
>>>>>
>>>>>     - EPOLLONESHOT is handled.
>>>>>
>>>>> This series introduces pollable epoll from userspace, i.e. user
>>>>> creates
>>>>> epfd with a new EPOLL_USERPOLL flag, mmaps epoll descriptor, gets
>>>>> header
>>>>> and ring pointers and then consumes ready events from a ring,
>>>>> avoiding
>>>>> epoll_wait() call.  When ring is empty, user has to call
>>>>> epoll_wait()
>>>>> in order to wait for new events.  epoll_wait() returns -ESTALE if
>>>>> user
>>>>> ring has events in the ring (kind of indication, that user has to
>>>>> consume
>>>>> events from the user ring first, I could not invent anything better
>>>>> than
>>>>> returning -ESTALE).
>>>>>
>>>>> For user header and user ring allocation I used vmalloc_user().  I
>>>>> found
>>>>> that it is much easy to reuse remap_vmalloc_range_partial() instead
>>>>> of
>>>>> dealing with page cache (like aio.c does).  What is also nice is
>>>>> that
>>>>> virtual address is properly aligned on SHMLBA, thus there should not
>>>>> be
>>>>> any d-cache aliasing problems on archs with vivt or vipt caches.
>>>>
>>>> Why aren't we just adding support to io_uring for this instead? Then
>>>> we
>>>> don't need yet another entirely new ring, that's is just a little
>>>> different from what we have.
>>>>
>>>> I haven't looked into the details of your implementation, just
>>>> curious
>>>> if there's anything that makes using io_uring a non-starter for this
>>>> purpose?
>>>
>>> Afaict the main difference is that you do not need to recharge an fd
>>> (submit new poll request in terms of io_uring): once fd has been added
>>> to
>>> epoll with epoll_ctl() - we get events.  When you have thousands of
>>> fds
>>> -
>>> that should matter.
>>>
>>> Also interesting question is how difficult to modify existing event
>>> loops
>>> in event libraries in order to support recharging (EPOLLONESHOT in
>>> terms
>>> of epoll).
>>>
>>> Maybe Azat who maintains libevent can shed light on this (currently I
>>> see
>>> that libevent does not support "EPOLLONESHOT" logic).
>>
>> In terms of existing io_uring poll support, which is what I'm guessing
>> you're referring to, it is indeed just one-shot.
> 
> Yes, yes.
> 
>> But there's no reason  why we can't have it persist until explicitly
>> canceled with POLL_REMOVE.
> 
> It seems not so easy.  The main problem is that with only a ring it is
> impossible to figure out on kernel side what event bits have been
> already
> seen by the userspace and what bits are new.  So every new cqe has to
> be added to a completion ring on each wake_up_interruptible() call.
> (I mean when fd wants to report that something is ready).
> 
> IMO that can lead to many duplicate events (tens? hundreds? honestly no
> idea), which userspace has to handle with subsequent read/write calls.
> It can kill all performance benefits of a uring.
> 
> In uepoll this is solved with another piece of shared memory, where
> userspace atomically clears bits and kernel side sets bits.  If kernel
> observes that bits were set (i.e. userspace has not seen this event)
> - new index is added to a ring.

Those are good points.

> Can we extend the io_uring API to support this behavior?  Also would
> be great if we can make event path lockless.  On a big number of fds
> and frequent events - this matters, please take a look, recently I
> did some measurements:  https://lkml.org/lkml/2018/12/12/305

Yeah, I'd be happy to entertain that idea, and lockless completions as
well. We already do that for polled IO, but consider any "normal"
completion to be IRQ driven and hence need locking.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 06/13] epoll: introduce helpers for adding/removing events to uring
  2019-05-31 18:58             ` Roman Penyaev
@ 2019-06-03  9:09               ` Peter Zijlstra
  2019-06-03 10:02                 ` Roman Penyaev
  0 siblings, 1 reply; 50+ messages in thread
From: Peter Zijlstra @ 2019-06-03  9:09 UTC (permalink / raw)
  To: Roman Penyaev; +Cc: azat, akpm, viro, torvalds, linux-fsdevel, linux-kernel

On Fri, May 31, 2019 at 08:58:19PM +0200, Roman Penyaev wrote:
> On 2019-05-31 18:51, Peter Zijlstra wrote:

> > But like you show, it can be done. It also makes the thing wait-free, as
> > opposed to merely lockless.
> 
> You think it's better?  I did not like this variant from the very
> beginning because of the unnecessary complexity.  But maybe you're
> right.  No busy loops on user side makes it wait-free.  And also
> I can avoid c11 in kernel using cmpxchg along with atomic_t.

Imagine the (v)CPU going for an extended nap right between publishing the
new tail and writing the !0 entry. Then your userspace is stuck burning
cycles without getting anything useful done.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 06/13] epoll: introduce helpers for adding/removing events to uring
  2019-06-03  9:09               ` Peter Zijlstra
@ 2019-06-03 10:02                 ` Roman Penyaev
  0 siblings, 0 replies; 50+ messages in thread
From: Roman Penyaev @ 2019-06-03 10:02 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: azat, akpm, viro, torvalds, linux-fsdevel, linux-kernel

On 2019-06-03 11:09, Peter Zijlstra wrote:
> On Fri, May 31, 2019 at 08:58:19PM +0200, Roman Penyaev wrote:
>> On 2019-05-31 18:51, Peter Zijlstra wrote:
> 
>> > But like you show, it can be done. It also makes the thing wait-free, as
>> > opposed to merely lockless.
>> 
>> You think it's better?  I did not like this variant from the very
>> beginning because of the unnecessary complexity.  But maybe you're
>> right.  No busy loops on user side makes it wait-free.  And also
>> I can avoid c11 in kernel using cmpxchg along with atomic_t.
> 
> Imagine the (v)CPU going for an extended nap right between publishing 
> the
> new tail and writing the !0 entry. Then your userspace is stuck burning
> cycles without getting anything useful done.

Yes, that is absolutely not nice.  That also worries me.  I wanted
to minimize number of atomic ops on hot path, and of course in that
respect this busy-loop is more attractive.

I will polish and switch back to the wait-free variant and resend the
whole patchset.  Could you please take a look?  Will add you to CC.

--
Roman

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 00/13] epoll: support pollable epoll from userspace
  2019-05-31 21:09         ` Jens Axboe
@ 2019-06-05  6:17           ` Roman Penyaev
  0 siblings, 0 replies; 50+ messages in thread
From: Roman Penyaev @ 2019-06-05  6:17 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Azat Khuzhin, Andrew Morton, Al Viro, Linus Torvalds,
	linux-fsdevel, linux-kernel

On 2019-05-31 23:09, Jens Axboe wrote:
> On 5/31/19 1:45 PM, Roman Penyaev wrote:
>> On 2019-05-31 18:54, Jens Axboe wrote:
>>> On 5/31/19 10:02 AM, Roman Penyaev wrote:
>>>> On 2019-05-31 16:48, Jens Axboe wrote:
>>>>> On 5/16/19 2:57 AM, Roman Penyaev wrote:
>>>>>> Hi all,
>>>>>> 
>>>>>> This is v3 which introduces pollable epoll from userspace.
>>>>>> 
>>>>>> v3:
>>>>>>     - Measurements made, represented below.
>>>>>> 
>>>>>>     - Fix alignment for epoll_uitem structure on all 64-bit archs
>>>>>> except
>>>>>>       x86-64. epoll_uitem should be always 16 bit, proper
>>>>>> BUILD_BUG_ON
>>>>>>       is added. (Linus)
>>>>>> 
>>>>>>     - Check pollflags explicitly on 0 inside work callback, and do
>>>>>> nothing
>>>>>>       if 0.
>>>>>> 
>>>>>> v2:
>>>>>>     - No reallocations, the max number of items (thus size of the
>>>>>> user
>>>>>> ring)
>>>>>>       is specified by the caller.
>>>>>> 
>>>>>>     - Interface is simplified: -ENOSPC is returned on attempt to 
>>>>>> add
>>>>>> a
>>>>>> new
>>>>>>       epoll item if number is reached the max, nothing more.
>>>>>> 
>>>>>>     - Alloced pages are accounted using user->locked_vm and 
>>>>>> limited
>>>>>> to
>>>>>>       RLIMIT_MEMLOCK value.
>>>>>> 
>>>>>>     - EPOLLONESHOT is handled.
>>>>>> 
>>>>>> This series introduces pollable epoll from userspace, i.e. user
>>>>>> creates
>>>>>> epfd with a new EPOLL_USERPOLL flag, mmaps epoll descriptor, gets
>>>>>> header
>>>>>> and ring pointers and then consumes ready events from a ring,
>>>>>> avoiding
>>>>>> epoll_wait() call.  When ring is empty, user has to call
>>>>>> epoll_wait()
>>>>>> in order to wait for new events.  epoll_wait() returns -ESTALE if
>>>>>> user
>>>>>> ring has events in the ring (kind of indication, that user has to
>>>>>> consume
>>>>>> events from the user ring first, I could not invent anything 
>>>>>> better
>>>>>> than
>>>>>> returning -ESTALE).
>>>>>> 
>>>>>> For user header and user ring allocation I used vmalloc_user().  I
>>>>>> found
>>>>>> that it is much easy to reuse remap_vmalloc_range_partial() 
>>>>>> instead
>>>>>> of
>>>>>> dealing with page cache (like aio.c does).  What is also nice is
>>>>>> that
>>>>>> virtual address is properly aligned on SHMLBA, thus there should 
>>>>>> not
>>>>>> be
>>>>>> any d-cache aliasing problems on archs with vivt or vipt caches.
>>>>> 
>>>>> Why aren't we just adding support to io_uring for this instead? 
>>>>> Then
>>>>> we
>>>>> don't need yet another entirely new ring, that's is just a little
>>>>> different from what we have.
>>>>> 
>>>>> I haven't looked into the details of your implementation, just
>>>>> curious
>>>>> if there's anything that makes using io_uring a non-starter for 
>>>>> this
>>>>> purpose?
>>>> 
>>>> Afaict the main difference is that you do not need to recharge an fd
>>>> (submit new poll request in terms of io_uring): once fd has been 
>>>> added
>>>> to
>>>> epoll with epoll_ctl() - we get events.  When you have thousands of
>>>> fds
>>>> -
>>>> that should matter.
>>>> 
>>>> Also interesting question is how difficult to modify existing event
>>>> loops
>>>> in event libraries in order to support recharging (EPOLLONESHOT in
>>>> terms
>>>> of epoll).
>>>> 
>>>> Maybe Azat who maintains libevent can shed light on this (currently 
>>>> I
>>>> see
>>>> that libevent does not support "EPOLLONESHOT" logic).
>>> 
>>> In terms of existing io_uring poll support, which is what I'm 
>>> guessing
>>> you're referring to, it is indeed just one-shot.
>> 
>> Yes, yes.
>> 
>>> But there's no reason  why we can't have it persist until explicitly
>>> canceled with POLL_REMOVE.
>> 
>> It seems not so easy.  The main problem is that with only a ring it is
>> impossible to figure out on kernel side what event bits have been
>> already
>> seen by the userspace and what bits are new.  So every new cqe has to
>> be added to a completion ring on each wake_up_interruptible() call.
>> (I mean when fd wants to report that something is ready).
>> 
>> IMO that can lead to many duplicate events (tens? hundreds? honestly 
>> no
>> idea), which userspace has to handle with subsequent read/write calls.
>> It can kill all performance benefits of a uring.
>> 
>> In uepoll this is solved with another piece of shared memory, where
>> userspace atomically clears bits and kernel side sets bits.  If kernel
>> observes that bits were set (i.e. userspace has not seen this event)
>> - new index is added to a ring.
> 
> Those are good points.
> 
>> Can we extend the io_uring API to support this behavior?  Also would
>> be great if we can make event path lockless.  On a big number of fds
>> and frequent events - this matters, please take a look, recently I
>> did some measurements:  https://lkml.org/lkml/2018/12/12/305
> 
> Yeah, I'd be happy to entertain that idea, and lockless completions as
> well. We already do that for polled IO, but consider any "normal"
> completion to be IRQ driven and hence need locking.

I would like to contribute as much as I can. "Subscription" on events
along with lockless ring seems reasonable to do for io_uring. I still
tend to think that uepoll and io_uring poll can coexist, at least
because it can be difficult to adopt current event libraries to async
nature of "add fd" / "remove add" requests of the io_uring, e.g. when
epoll_ctl() is called in order to remove fd, the caller expects no
events come after epoll_ctl() returns. Async behavior can break the
event loop. What can help is ability to wait on particular request,
which seems not possible without ugly tricks, right? (Under ugly tricks
I mean something as: wait for any event, traverse the completion ring
in order to meet particular completion, repeat if nothing is found).

Also epoll_ctl() can be called from another thread in order to
add/remove fd, and I suppose that is also successfully used by event
loop libraries or users of these libraries (not quite sure though, but
can imagine why it can be useful). To fix that will require introducing
locks on submission path of io_uring callers (I mean on user side,
inside these libraries), which can impact performance for generic
cases (only submission though).

What I want to say is that polling using io_uring can be used in some
new io/event stacks, but adoption of current event libraries can be
non trivial, where old plain epoll with a ring can be an easiest way.
But of course that's only my speculation.

--
Roman


^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2019-06-05  6:17 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-16  8:57 [PATCH v3 00/13] epoll: support pollable epoll from userspace Roman Penyaev
2019-05-16  8:57 ` [PATCH v3 01/13] epoll: move private helpers from a header to the source Roman Penyaev
2019-05-16  8:57 ` [PATCH v3 02/13] epoll: introduce user structures for polling from userspace Roman Penyaev
2019-05-16  8:58 ` [PATCH v3 03/13] epoll: allocate user header and user events ring " Roman Penyaev
2019-05-16  8:58 ` [PATCH v3 04/13] epoll: some sanity flags checks for epoll syscalls " Roman Penyaev
2019-05-16  8:58 ` [PATCH v3 05/13] epoll: offload polling to a work in case of epfd polled " Roman Penyaev
2019-05-21  7:51   ` Eric Wong
2019-05-22 12:54     ` Roman Penyaev
2019-05-16  8:58 ` [PATCH v3 06/13] epoll: introduce helpers for adding/removing events to uring Roman Penyaev
2019-05-31  9:55   ` Peter Zijlstra
2019-05-31 11:24     ` Roman Penyaev
2019-05-31 13:11       ` Peter Zijlstra
2019-05-31  9:56   ` Peter Zijlstra
2019-05-31 11:15     ` Roman Penyaev
2019-05-31 12:53       ` Peter Zijlstra
2019-05-31 14:28         ` Roman Penyaev
2019-05-31 16:53           ` Peter Zijlstra
2019-05-31 12:56       ` Peter Zijlstra
2019-05-31 14:21         ` Roman Penyaev
2019-05-31 16:51           ` Peter Zijlstra
2019-05-31 18:58             ` Roman Penyaev
2019-06-03  9:09               ` Peter Zijlstra
2019-06-03 10:02                 ` Roman Penyaev
2019-05-16  8:58 ` [PATCH v3 07/13] epoll: call ep_add_event_to_uring() from ep_poll_callback() Roman Penyaev
2019-05-31  9:56   ` Peter Zijlstra
2019-05-31 11:22     ` Roman Penyaev
2019-05-31 13:05       ` Peter Zijlstra
2019-05-31 15:05         ` Roman Penyaev
2019-05-16  8:58 ` [PATCH v3 08/13] epoll: support polling from userspace for ep_insert() Roman Penyaev
2019-05-16  8:58 ` [PATCH v3 09/13] epoll: support polling from userspace for ep_remove() Roman Penyaev
2019-05-16  8:58 ` [PATCH v3 10/13] epoll: support polling from userspace for ep_modify() Roman Penyaev
2019-05-16  8:58 ` [PATCH v3 11/13] epoll: support polling from userspace for ep_poll() Roman Penyaev
2019-05-16  8:58 ` [PATCH v3 12/13] epoll: support mapping for epfd when polled from userspace Roman Penyaev
2019-05-16  8:58 ` [PATCH v3 13/13] epoll: implement epoll_create2() syscall Roman Penyaev
2019-05-16 10:03   ` Arnd Bergmann
2019-05-16 10:20     ` Roman Penyaev
2019-05-16 10:57       ` Arnd Bergmann
2019-05-22  2:33       ` Andrew Morton
2019-05-22  9:11         ` Roman Penyaev
2019-05-22 11:14         ` Arnd Bergmann
2019-05-22 18:36           ` Andrew Morton
2019-05-31  9:55 ` [PATCH v3 00/13] epoll: support pollable epoll from userspace Peter Zijlstra
2019-05-31 14:48 ` Jens Axboe
2019-05-31 16:02   ` Roman Penyaev
2019-05-31 16:54     ` Jens Axboe
2019-05-31 19:45       ` Roman Penyaev
2019-05-31 21:09         ` Jens Axboe
2019-06-05  6:17           ` Roman Penyaev
2019-05-31 16:33 ` Peter Zijlstra
2019-05-31 18:50   ` Roman Penyaev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).