LKML Archive on lore.kernel.org
 help / color / Atom feed
* [RFC 00/15] epoll: support pollable epoll from userspace
@ 2019-01-09 16:40 Roman Penyaev
  2019-01-09 16:40 ` [RFC PATCH 01/15] mm/vmalloc: add new 'alignment' field for vm_struct structure Roman Penyaev
                   ` (14 more replies)
  0 siblings, 15 replies; 20+ messages in thread
From: Roman Penyaev @ 2019-01-09 16:40 UTC (permalink / raw)
  Cc: Roman Penyaev, Luis R. Rodriguez, Paul E. McKenney, Al Viro,
	Andrea Parri, Andrew Morton, Andrey Ryabinin, Davidlohr Bueso,
	Jason Baron, Joe Perches, Linus Torvalds, Michal Hocko,
	linux-fsdevel, linux-mm, linux-kernel

Hi all,

This series introduces pollable epoll from userspace, i.e. user creates
epfd with a new EPOLL_USERPOLL flag, mmaps epoll descriptor, gets header
and ring pointers and then consumes ready events from a ring, avoiding
epoll_wait() call.  When ring is empty, user has to call epoll_wait()
in order to wait for new events.  epoll_wait() returns -ESTALE if user
ring has events in the ring (kind of indication, that user has to consume
events from the user ring first, I could not invent anything better than
returning -ESTALE).

For user header and user ring allocation I used vmalloc_user().  I found
that it is much easy to reuse remap_vmalloc_range_partial() instead of
dealing with page cache (like aio.c does).  What is also nice is that
virtual address is properly aligned on SHMLBA, thus there should not be
any d-cache aliasing problems on archs with vivt or vipt caches.

Also I required vrealloc(), which can hide all this "alloc new area - get
pages - map pages" stuff.  So vrealloc() is introduced in first 3 patches.

** Limitations
    
1. Expect always EPOLLET flag for new epoll items (Edge Triggered behavior)
     obviously we can't call vfs_epoll() from userpace to have level
     triggered behaviour.
    
2. No support for EPOLLWAKEUP
     events are consumed from userspace, thus no way to call __pm_relax()
    
3. No support for EPOLLEXCLUSIVE
     If device does not pass pollflags to wake_up() there is no way to
     call poll() from the context under spinlock, thus special work is
     scheduled to offload polling.  In this specific case we can't
     support exclusive wakeups, because we do not know actual result
     of scheduled work and have to wake up every waiter.
    
4. No support for nesting of epoll descriptors polled from userspace
     no real good reason to scan ready events of user ring from the
     kernel, so just do not do that.


** Principle of operation

* Basic structures shared with userspace:

In order to consume events from userspace all inserted items should be
stored in items array, which has original epoll_event field and u32
field for keeping ready events, i.e. each item has the following struct:

 struct user_epitem {
    unsigned int ready_events;
    struct epoll_event event;
 };
 BUILD_BUG_ON(sizeof(struct user_epitem) != 16);

And the following is a header, which is seen by userspace:

 struct user_header {
    unsigned int magic;          /* epoll user header magic */
    unsigned int state;          /* epoll ring state */
    unsigned int header_length;  /* length of the header + items */
    unsigned int index_length;   /* length of the index ring */
    unsigned int max_items_nr;   /* max num of items slots */
    unsigned int max_index_nr;   /* max num of items indeces, always pow2 */
    unsigned int head;           /* updated by userland */
    unsigned int tail;           /* updated by kernel */
    unsigned int padding[24];    /* Header size is 128 bytes */

    struct user_epitem items[];
 };

 /* Header is 128 bytes, thus items are aligned on CPU cache */
 BUILD_BUG_ON(sizeof(struct user_header) != 128);

From the very beginning kernel allocates 1 page for user header, i.e. by
default we have 248 items for 4096 size page.

When 249'th item is inserted special expanding should be done, which will
be discussed later.

Ready events are kept in a ring buffer, which is simply an index table,
where each element points to an item in a header:

 unsinged int *user_index;

Kernel allocates also 1 page for user index, i.e. for 4096 page we have
1024 ring elements capacity.


* How is new event accounted on kernel side?  Hot it is consumed from
* userspace?

When new event comes for some epoll item kernel does the following:

 struct user_epitem *uitem;

 /* Each item has a bit (index in user items array), discussed later */
 uitem = user_header->items[epi->bit];

 if (!atomic_fetch_or(uitem->ready_events, pollflags)) {
     i = atomic_add(&ep->user_header->tail, 1);

     item_idx = &user_index[i & index_mask];

     /* Signal with a bit, user spins on index expecting value > 0 */
     *item_idx = idx + 1;

    /*
     * Want index update be flushed from CPU write buffer and
     * immediately visible on userspace side to avoid long busy
     * loops.
     */
     smp_wmb();
 }

Important thing here is that ring can't infinitely grow and corrupt other
elements, because kernel always checks that item was marked as ready, so
userspace has to clear ready_events field.

On userside events the following code should be used in order to consume
events:

 tail = READ_ONCE(header->tail);
 for (i = 0; header->head != tail; header->head++) {
     item_idx_ptr = &index[idx & indeces_mask];

     /*
      * Spin here till we see valid index
      */
     while (!(idx = __atomic_load_n(item_idx_ptr, __ATOMIC_ACQUIRE)))
         ;

     item = &header->items[idx - 1];

     /*
      * Mark index as invalid, that is for userspace only, kernel does not care
      * and will refill this pointer only when observes that event is cleared,
      * which happens below.
      */
     *item_idx_ptr = 0;

     /*
      * Fetch data first, if event is cleared by the kernel we drop the data
      * returning false.
      */
     event->data = item->event.data;
     event->events = __atomic_exchange_n(&item->ready_events, 0,
                         __ATOMIC_RELEASE);

 }


* How new epoll item gets its index inside user items array?

Kernel has a bitmap for that and gets free bit on attempt to insert a new
epoll item.  When bitmap is full - it has been expanded.

* What happens when user items or user index has to be expanded or shrunk?

For that quite rare cases kernel has to ask userspace to invoke epoll_wait()
in order to reallocate all user pointers under locks, i.e. for that
particular period all events are routed to kernel lists instead of user
ring and kernel sets special INACTIVE state in user header in order to
notify user that new event's won't appear in the ring until the user
calls epoll_wait().  Worth to mention, that expand is done directly inside
ep_insert(), because expand is an allocation of a new page and recreation
of virtual area on kernel side, which does not affect mappings on userside.

* How userspace detects that kernel has expanded or shrunk the memory?

Any of the item ctl operations (add, mod, del) can be executed in parallel
with events consumption from user ring.

Expand is safe from user perspective (new pages is mapped to kernel side,
but user does not know and care about that), so expand happens directly
in epoll_ctl(EPOLL_CTL_ADD), but kernel routes all new events to kernel
lists and asks user to call epoll_wait() with special INACTIVE state.

Shrink is a bit different.  When epoll_ctl(EPOLL_CTL_DEL) is called and
kernel decides to shrink the memory, it routes new events to kernel lists,
marks user header state as INACTIVE and does not put item bit immediately,
but postpones it until user calls epoll_wait() (which should happen soon,
because user_header->state is INACTIVE and user should come to sleep to
kernel).  So shrink happens only on epoll_wait() call with all necessary
locks taken.

Bit put should be postponed because user can observe corrupted event item
if events are not yet consumed from the ring, bit is put and then
immediately reused by concurrent item insert.  To avoid this possible
race bit put is postponed when header state is INACTIVE and all events
are routed to kernel lists.

So returning to the quesion: how userspace detects that kernel has changed
the memory?  User has to cache lengths before epoll_wait(), compare old
cached values with new from header and call mremap() if values differ:

 header_length = header->header_length;
 index_length = header->index_length;

 rc = epoll_wait(epfd, NULL, 0, -1);
 assert(rc < 0);
 if (errno != -ESTALE)
     return -errno;

 if (header_length != header->header_length) {
    header = mremap(header, header_length, header->header_length, MREMAP_MAYMOVE);
    assert(header != MAP_FAILED);
 }
 if (index_length != header->index_length) {
    index = mremap(index, index_length, header->index_length, MREMAP_MAYMOVE);
    assert(index != MAP_FAILED);
 }

* Is it possible to consume events from many threads on userspace side?

That should be possible in a lockless manner, and kernel keeps extra number
of free slots in a ring (EPOLL_USER_EXTRA_INDEX_NR = 16) in order to let
user consume events from up to 16 threads in parallel.

It seems that this can be a good feature thinking about performance, but I
could not decide is it enough to report this value in a user header or let
user change that somehow on epoll_create1() call (or a new one?).

* Is there any testing app available?

There is a small app [1] which starts many threads with many event fds and
produces many events, while single consumer fetches them from userspace
and goes to kernel from time to time in order to wait.


This is RFC because for memory allocation I used vmalloc(), which virtual
space for kernel seems limited for some archs.  So for example for 1 mln
of items kernel has to allocate 10^6 x 16 [items] + 10^6 x 4 [index],
that is around ~20mb, seems very small, but not sure is it ok or not.

I temporarily used gcc atomic builtins on kernel side, because I did find
any good way to atomically update plain unsigned int of user_header
structure without casting it to atomic_t.  Or casting is fine in that case?

There are not enough good, informative and shiny comments in the code,
explaining all the machinery.  The most hard part is left, I would say.

Only very basic scenarios are tested, all these things with user
reallocations (expand, shrink) are not tested at all.

[1] https://github.com/rouming/test-tools/blob/master/userpolled-epoll.c

Roman Penyaev (15):
  mm/vmalloc: add new 'alignment' field for vm_struct structure
  mm/vmalloc: move common logic from  __vmalloc_area_node to a separate
    func
  mm/vmalloc: introduce new vrealloc() call and its subsidiary reach
    analog
  epoll: move private helpers from a header to the source
  epoll: introduce user header structure and user index for polling from
    userspace
  epoll: introduce various of helpers for user structure lengths
    calculations
  epoll: extend epitem struct with new members for polling from
    userspace
  epoll: some sanity flags checks for epoll syscalls for polled epfd
    from userspace
  epoll: introduce stand-alone helpers for polling from userspace
  epoll: support polling from userspace for ep_insert()
  epoll: offload polling to a work in case of epfd polled from userspace
  epoll: support polling from userspace for ep_remove()
  epoll: support polling from userspace for ep_modify()
  epoll: support polling from userspace for ep_poll()
  epoll: support mapping for epfd when polled from userspace

 fs/eventpoll.c                 | 1042 +++++++++++++++++++++++++++++---
 include/linux/vmalloc.h        |    4 +
 include/uapi/linux/eventpoll.h |   15 +-
 mm/vmalloc.c                   |  152 ++++-
 4 files changed, 1117 insertions(+), 96 deletions(-)

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Joe Perches <joe@perches.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
-- 
2.19.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH 01/15] mm/vmalloc: add new 'alignment' field for vm_struct structure
  2019-01-09 16:40 [RFC 00/15] epoll: support pollable epoll from userspace Roman Penyaev
@ 2019-01-09 16:40 ` Roman Penyaev
  2019-01-09 16:40 ` [RFC PATCH 02/15] mm/vmalloc: move common logic from __vmalloc_area_node to a separate func Roman Penyaev
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 20+ messages in thread
From: Roman Penyaev @ 2019-01-09 16:40 UTC (permalink / raw)
  Cc: Roman Penyaev, Andrew Morton, Michal Hocko, Andrey Ryabinin,
	Joe Perches, Luis R. Rodriguez, linux-mm, linux-kernel

I need a new alignment field for vm area in order to reallocate
previously allocated area with the same alignment.

Patch for a new vrealloc() call will follow and this new call
I want to keep as simple as possible, thus not to provide dozens
of variants, like vrealloc_user(), which cares about alignment.

Current changes are just preparations.

Worth to mention, that on archs were unsigned long is 64 bit
this new field does not bloat vm_struct, because originally
there was a padding between nr_pages and phys_addr.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Joe Perches <joe@perches.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/vmalloc.h |  1 +
 mm/vmalloc.c            | 10 ++++++----
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 398e9c95cd61..78210aa0bb43 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -38,6 +38,7 @@ struct vm_struct {
 	unsigned long		flags;
 	struct page		**pages;
 	unsigned int		nr_pages;
+	unsigned int		alignment;
 	phys_addr_t		phys_addr;
 	const void		*caller;
 };
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index e83961767dc1..4851b4a67f55 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1347,12 +1347,14 @@ int map_vm_area(struct vm_struct *area, pgprot_t prot, struct page **pages)
 EXPORT_SYMBOL_GPL(map_vm_area);
 
 static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va,
-			      unsigned long flags, const void *caller)
+			     unsigned int align, unsigned long flags,
+			     const void *caller)
 {
 	spin_lock(&vmap_area_lock);
 	vm->flags = flags;
 	vm->addr = (void *)va->va_start;
 	vm->size = va->va_end - va->va_start;
+	vm->alignment = align;
 	vm->caller = caller;
 	va->vm = vm;
 	va->flags |= VM_VM_AREA;
@@ -1399,7 +1401,7 @@ static struct vm_struct *__get_vm_area_node(unsigned long size,
 		return NULL;
 	}
 
-	setup_vmalloc_vm(area, va, flags, caller);
+	setup_vmalloc_vm(area, va, align, flags, caller);
 
 	return area;
 }
@@ -2601,8 +2603,8 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 
 	/* insert all vm's */
 	for (area = 0; area < nr_vms; area++)
-		setup_vmalloc_vm(vms[area], vas[area], VM_ALLOC,
-				 pcpu_get_vm_areas);
+		setup_vmalloc_vm(vms[area], vas[area], align,
+				 VM_ALLOC, pcpu_get_vm_areas);
 
 	kfree(vas);
 	return vms;
-- 
2.19.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH 02/15] mm/vmalloc: move common logic from  __vmalloc_area_node to a separate func
  2019-01-09 16:40 [RFC 00/15] epoll: support pollable epoll from userspace Roman Penyaev
  2019-01-09 16:40 ` [RFC PATCH 01/15] mm/vmalloc: add new 'alignment' field for vm_struct structure Roman Penyaev
@ 2019-01-09 16:40 ` Roman Penyaev
  2019-01-09 16:40 ` [RFC PATCH 03/15] mm/vmalloc: introduce new vrealloc() call and its subsidiary reach analog Roman Penyaev
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 20+ messages in thread
From: Roman Penyaev @ 2019-01-09 16:40 UTC (permalink / raw)
  Cc: Roman Penyaev, Andrew Morton, Michal Hocko, Andrey Ryabinin,
	Joe Perches, Luis R. Rodriguez, linux-mm, linux-kernel

This one moves logic related to pages array creation to a separate
function, which will be used by vrealloc() call as well, which
implementation will follow.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Joe Perches <joe@perches.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/vmalloc.c | 36 +++++++++++++++++++++++++++++-------
 1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 4851b4a67f55..ad6cd807f6db 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1662,21 +1662,26 @@ EXPORT_SYMBOL(vmap);
 static void *__vmalloc_node(unsigned long size, unsigned long align,
 			    gfp_t gfp_mask, pgprot_t prot,
 			    int node, const void *caller);
-static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
-				 pgprot_t prot, int node)
+
+static int alloc_vm_area_array(struct vm_struct *area, gfp_t gfp_mask, int node)
 {
+	unsigned int nr_pages, array_size;
 	struct page **pages;
-	unsigned int nr_pages, array_size, i;
+
 	const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;
-	const gfp_t alloc_mask = gfp_mask | __GFP_NOWARN;
 	const gfp_t highmem_mask = (gfp_mask & (GFP_DMA | GFP_DMA32)) ?
 					0 :
 					__GFP_HIGHMEM;
 
+	if (WARN_ON(area->pages))
+		return -EINVAL;
+
 	nr_pages = get_vm_area_size(area) >> PAGE_SHIFT;
+	if (!nr_pages)
+		return -EINVAL;
+
 	array_size = (nr_pages * sizeof(struct page *));
 
-	area->nr_pages = nr_pages;
 	/* Please note that the recursion is strictly bounded. */
 	if (array_size > PAGE_SIZE) {
 		pages = __vmalloc_node(array_size, 1, nested_gfp|highmem_mask,
@@ -1684,8 +1689,25 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 	} else {
 		pages = kmalloc_node(array_size, nested_gfp, node);
 	}
+	if (!pages)
+		return -ENOMEM;
+
+	area->nr_pages = nr_pages;
 	area->pages = pages;
-	if (!area->pages) {
+
+	return 0;
+}
+
+static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
+				 pgprot_t prot, int node)
+{
+	const gfp_t alloc_mask = gfp_mask | __GFP_NOWARN;
+	const gfp_t highmem_mask = (gfp_mask & (GFP_DMA | GFP_DMA32)) ?
+					0 :
+					__GFP_HIGHMEM;
+	unsigned int i;
+
+	if (alloc_vm_area_array(area, gfp_mask, node)) {
 		remove_vm_area(area->addr);
 		kfree(area);
 		return NULL;
@@ -1709,7 +1731,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 			cond_resched();
 	}
 
-	if (map_vm_area(area, prot, pages))
+	if (map_vm_area(area, prot, area->pages))
 		goto fail;
 	return area->addr;
 
-- 
2.19.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH 03/15] mm/vmalloc: introduce new vrealloc() call and its subsidiary reach analog
  2019-01-09 16:40 [RFC 00/15] epoll: support pollable epoll from userspace Roman Penyaev
  2019-01-09 16:40 ` [RFC PATCH 01/15] mm/vmalloc: add new 'alignment' field for vm_struct structure Roman Penyaev
  2019-01-09 16:40 ` [RFC PATCH 02/15] mm/vmalloc: move common logic from __vmalloc_area_node to a separate func Roman Penyaev
@ 2019-01-09 16:40 ` Roman Penyaev
  2019-01-09 16:50   ` Matthew Wilcox
  2019-01-09 16:40 ` [RFC PATCH 04/15] epoll: move private helpers from a header to the source Roman Penyaev
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 20+ messages in thread
From: Roman Penyaev @ 2019-01-09 16:40 UTC (permalink / raw)
  Cc: Roman Penyaev, Andrew Morton, Michal Hocko, Andrey Ryabinin,
	Joe Perches, Luis R. Rodriguez, linux-mm, linux-kernel

Function changes the size of virtual contigues memory, previously
allocated by vmalloc().

vrealloc() under the hood does the following:

 1. allocates new vm area based on the alignment of the old one.
 2. allocates pages array for a new vm area.
 3. fill in ->pages array taking pages from the old area increasing
    page ref.

    In case of virtual size grow (old_size < new_size) new pages
    for a new area are allocated using gfp passed by the caller.

Basically vrealloc() repeats glibc realloc() with only one big difference:
old area is not freed, i.e. caller is responsible for calling vfree() in
case of successfull reallocation.

Why vfree() is not called for old area directly from vrealloc()?  Because
sometimes it is better just to have transaction-like reallocation for
several pointers and reallocate all at once, i.e.:

  new_p1 = vrealloc(p1, new_len);
  new_p2 = vrealloc(p2, new_len);
  if (!new_p1 || !new_p2) {
	vfree(new_p1);
	vfree(new_p2);
	return -ENOMEM;
  }

  vfree(p1);
  vfree(p2);

  p1 = new_p1;
  p2 = new_p2;

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Joe Perches <joe@perches.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/vmalloc.h |   3 ++
 mm/vmalloc.c            | 106 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 109 insertions(+)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 78210aa0bb43..2902faf26c4f 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -72,6 +72,7 @@ static inline void vmalloc_init(void)
 
 extern void *vmalloc(unsigned long size);
 extern void *vzalloc(unsigned long size);
+extern void *vrealloc(void *old_addr, unsigned long size);
 extern void *vmalloc_user(unsigned long size);
 extern void *vmalloc_node(unsigned long size, int node);
 extern void *vzalloc_node(unsigned long size, int node);
@@ -83,6 +84,8 @@ extern void *__vmalloc_node_range(unsigned long size, unsigned long align,
 			unsigned long start, unsigned long end, gfp_t gfp_mask,
 			pgprot_t prot, unsigned long vm_flags, int node,
 			const void *caller);
+extern void *__vrealloc_node(void *old_addr, unsigned long size, gfp_t gfp_mask,
+			     pgprot_t prot, int node, const void *caller);
 #ifndef CONFIG_MMU
 extern void *__vmalloc_node_flags(unsigned long size, int node, gfp_t flags);
 static inline void *__vmalloc_node_flags_caller(unsigned long size, int node,
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ad6cd807f6db..94cc99e780c7 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1889,6 +1889,112 @@ void *vzalloc(unsigned long size)
 }
 EXPORT_SYMBOL(vzalloc);
 
+void *__vrealloc_node(void *old_addr, unsigned long size, gfp_t gfp_mask,
+		      pgprot_t prot, int node, const void *caller)
+{
+	const gfp_t alloc_mask = gfp_mask | __GFP_NOWARN;
+	const gfp_t highmem_mask = (gfp_mask & (GFP_DMA | GFP_DMA32)) ?	0 :
+					__GFP_HIGHMEM;
+	struct vm_struct *old_area, *area;
+	struct page *page;
+
+	unsigned int i;
+
+	old_area = find_vm_area(old_addr);
+	if (!old_area)
+		return NULL;
+
+	if (!(old_area->flags & VM_ALLOC))
+		return NULL;
+
+	size = PAGE_ALIGN(size);
+	if (!size || (size >> PAGE_SHIFT) > totalram_pages())
+		return NULL;
+
+	if (get_vm_area_size(old_area) == size)
+		return old_addr;
+
+	area = __get_vm_area_node(size, old_area->alignment, VM_UNINITIALIZED |
+				  old_area->flags, VMALLOC_START, VMALLOC_END,
+				  node, gfp_mask, caller);
+	if (!area)
+		return NULL;
+
+	if (alloc_vm_area_array(area, gfp_mask, node)) {
+		__vunmap(area->addr, 0);
+		return NULL;
+	}
+
+	for (i = 0; i < area->nr_pages; i++) {
+		if (i < old_area->nr_pages) {
+			/* Take a page from old area and increase a ref */
+
+			page = old_area->pages[i];
+			area->pages[i] = page;
+			get_page(page);
+		} else {
+			/* Allocate more pages in case of grow */
+
+			page = alloc_page(alloc_mask|highmem_mask);
+			if (unlikely(!page)) {
+				/*
+				 * Successfully allocated i pages, free
+				 * them in __vunmap()
+				 */
+				area->nr_pages = i;
+				goto fail;
+			}
+
+			area->pages[i] = page;
+			if (gfpflags_allow_blocking(gfp_mask|highmem_mask))
+				cond_resched();
+		}
+	}
+	if (map_vm_area(area, prot, area->pages))
+		goto fail;
+
+	/* New area is fully ready */
+	clear_vm_uninitialized_flag(area);
+	kmemleak_vmalloc(area, size, gfp_mask);
+
+	return area->addr;
+
+fail:
+	warn_alloc(gfp_mask, NULL, "vrealloc: allocation failure");
+	__vfree(area->addr);
+
+	return NULL;
+}
+EXPORT_SYMBOL(__vrealloc_node);
+
+/**
+ *	vrealloc - reallocate virtually contiguous memory with zero fill
+ *	@old_addr:	old virtual address
+ *	@size:		new size
+ *
+ *	Allocate additional pages to cover new @size from the page level
+ *	allocator if memory grows. Then pages are mapped into a new
+ *	contiguous kernel virtual space, previous area is NOT freed.
+ *
+ *	Do not forget to call vfree() passing old address.  But careful,
+ *	calling vfree() from interrupt will cause vfree_deferred() call,
+ *	which in its turn uses freed address as a temporal pointer for a
+ *	llist element, i.e. memory will be corrupted.
+ *
+ *	If new size is equal to the old size - old pointer is returned.
+ *	I.e. appropriate check should be made before calling vfree().
+ *
+ *	For tight control over page level allocator and protection flags
+ *	use __vrealloc_node() instead.
+ */
+void *vrealloc(void *old_addr, unsigned long size)
+{
+	return __vrealloc_node(old_addr, size, GFP_KERNEL | __GFP_ZERO,
+			       PAGE_KERNEL, NUMA_NO_NODE,
+			       __builtin_return_address(0));
+}
+EXPORT_SYMBOL(vrealloc);
+
 /**
  * vmalloc_user - allocate zeroed virtually contiguous memory for userspace
  * @size: allocation size
-- 
2.19.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH 04/15] epoll: move private helpers from a header to the source
  2019-01-09 16:40 [RFC 00/15] epoll: support pollable epoll from userspace Roman Penyaev
                   ` (2 preceding siblings ...)
  2019-01-09 16:40 ` [RFC PATCH 03/15] mm/vmalloc: introduce new vrealloc() call and its subsidiary reach analog Roman Penyaev
@ 2019-01-09 16:40 ` Roman Penyaev
  2019-01-09 16:40 ` [RFC PATCH 05/15] epoll: introduce user header structure and user index for polling from userspace Roman Penyaev
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 20+ messages in thread
From: Roman Penyaev @ 2019-01-09 16:40 UTC (permalink / raw)
  Cc: Roman Penyaev, Andrew Morton, Davidlohr Bueso, Jason Baron,
	Al Viro, Paul E. McKenney, Linus Torvalds, Andrea Parri,
	linux-fsdevel, linux-kernel

Those helpers will access private eventpoll structure in future patches,
so keep those helpers close to callers.

Nothing important here.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 fs/eventpoll.c                 | 13 +++++++++++++
 include/uapi/linux/eventpoll.h | 12 ------------
 2 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 4a0e98d87fcc..2cc183e86a29 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -466,6 +466,19 @@ static inline void ep_set_busy_poll_napi_id(struct epitem *epi)
 
 #endif /* CONFIG_NET_RX_BUSY_POLL */
 
+#ifdef CONFIG_PM_SLEEP
+static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
+{
+	if ((epev->events & EPOLLWAKEUP) && !capable(CAP_BLOCK_SUSPEND))
+		epev->events &= ~EPOLLWAKEUP;
+}
+#else
+static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
+{
+	epev->events &= ~EPOLLWAKEUP;
+}
+#endif /* CONFIG_PM_SLEEP */
+
 /**
  * ep_call_nested - Perform a bound (possibly) nested call, by checking
  *                  that the recursion limit is not exceeded, and that
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index 8a3432d0f0dc..39dfc29f0f52 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -79,16 +79,4 @@ struct epoll_event {
 	__u64 data;
 } EPOLL_PACKED;
 
-#ifdef CONFIG_PM_SLEEP
-static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
-{
-	if ((epev->events & EPOLLWAKEUP) && !capable(CAP_BLOCK_SUSPEND))
-		epev->events &= ~EPOLLWAKEUP;
-}
-#else
-static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
-{
-	epev->events &= ~EPOLLWAKEUP;
-}
-#endif
 #endif /* _UAPI_LINUX_EVENTPOLL_H */
-- 
2.19.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH 05/15] epoll: introduce user header structure and user index for polling from userspace
  2019-01-09 16:40 [RFC 00/15] epoll: support pollable epoll from userspace Roman Penyaev
                   ` (3 preceding siblings ...)
  2019-01-09 16:40 ` [RFC PATCH 04/15] epoll: move private helpers from a header to the source Roman Penyaev
@ 2019-01-09 16:40 ` Roman Penyaev
  2019-01-09 16:40 ` [RFC PATCH 06/15] epoll: introduce various of helpers for user structure lengths calculations Roman Penyaev
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 20+ messages in thread
From: Roman Penyaev @ 2019-01-09 16:40 UTC (permalink / raw)
  Cc: Roman Penyaev, Andrew Morton, Davidlohr Bueso, Jason Baron,
	Al Viro, Paul E. McKenney, Linus Torvalds, Andrea Parri,
	linux-fsdevel, linux-kernel

This one introduces main user structures: user header and user index.
Header describes current state of epoll, head and tail of the index
ring, epoll items at the end of the structure.

Index table is a ring, which is controlled by head and tail from the
user header.  Ring consists of u32 indeces, pointing to items in header,
which have been ready for polling.

Userspace has to call epoll_create1(EPOLL_USERPOLL) in order to start
using polling from user side.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 fs/eventpoll.c                 | 107 ++++++++++++++++++++++++++++++++-
 include/uapi/linux/eventpoll.h |   3 +-
 2 files changed, 106 insertions(+), 4 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 2cc183e86a29..9ec682b6488f 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -178,6 +178,42 @@ struct epitem {
 	struct epoll_event event;
 };
 
+#define EPOLL_USER_HEADER_SIZE  128
+#define EPOLL_USER_HEADER_MAGIC 0xeb01eb01
+
+enum {
+	EPOLL_USER_POLL_INACTIVE = 0, /* user poll disactivated */
+	EPOLL_USER_POLL_ACTIVE   = 1, /* user can continue busy polling */
+
+	/*
+	 * Always keep some slots ahead to be able to consume new events
+	 * from many threads, i.e. if N threads consume ring from userspace,
+	 * we have to keep N free slots ahead to avoid ring overlap.
+	 *
+	 * Probably this number should be reported to userspace in header.
+	 */
+	EPOLL_USER_EXTRA_INDEX_NR = 16 /* how many extra indeces keep in ring */
+};
+
+struct user_epitem {
+	unsigned int ready_events;
+	struct epoll_event event;
+};
+
+struct user_header {
+	unsigned int magic;          /* epoll user header magic */
+	unsigned int state;          /* epoll ring state */
+	unsigned int header_length;  /* length of the header + items */
+	unsigned int index_length;   /* length of the index ring */
+	unsigned int max_items_nr;   /* max num of items slots */
+	unsigned int max_index_nr;   /* max num of items indeces, always pow2 */
+	unsigned int head;           /* updated by userland */
+	unsigned int tail;           /* updated by kernel */
+	unsigned int padding[24];    /* Header size is 128 bytes */
+
+	struct user_epitem items[];
+};
+
 /*
  * This structure is stored inside the "private_data" member of the file
  * structure and represents the main data structure for the eventpoll
@@ -222,6 +258,36 @@ struct eventpoll {
 
 	struct file *file;
 
+	/* User header with array of items */
+	struct user_header *user_header;
+
+	/* User index, which acts as a ring of coming events */
+	unsigned int *user_index;
+
+	/* Actual length of user header, always aligned on page */
+	unsigned int header_length;
+
+	/* Actual length of user index, always aligned on page */
+	unsigned int index_length;
+
+	/* Number of event items */
+	unsigned int items_nr;
+
+	/* Items bitmap, is used to get a free bit for new registered epi */
+	unsigned long *items_bm;
+
+	/* Removed items bitmap, is used to postpone bit put */
+	unsigned long *removed_items_bm;
+
+	/* Length of both items bitmaps, always aligned on page */
+	unsigned int items_bm_length;
+
+	/*
+	 * Where events are routed: to kernel lists or to user ring.
+	 * Always false for epfd created without EPOLL_USERPOLL.
+	 */
+	bool events_to_uring;
+
 	/* used to optimize loop detection check */
 	int visited;
 	struct list_head visited_list_link;
@@ -876,6 +942,10 @@ static void ep_free(struct eventpoll *ep)
 	mutex_destroy(&ep->mtx);
 	free_uid(ep->user);
 	wakeup_source_unregister(ep->ws);
+	vfree(ep->user_header);
+	vfree(ep->user_index);
+	vfree(ep->items_bm);
+	vfree(ep->removed_items_bm);
 	kfree(ep);
 }
 
@@ -1028,7 +1098,7 @@ void eventpoll_release_file(struct file *file)
 	mutex_unlock(&epmutex);
 }
 
-static int ep_alloc(struct eventpoll **pep)
+static int ep_alloc(struct eventpoll **pep, int flags)
 {
 	int error;
 	struct user_struct *user;
@@ -1040,6 +1110,31 @@ static int ep_alloc(struct eventpoll **pep)
 	if (unlikely(!ep))
 		goto free_uid;
 
+	if (flags & EPOLL_USERPOLL) {
+		ep->user_header = vmalloc_user(PAGE_SIZE);
+		ep->user_index = vmalloc_user(PAGE_SIZE);
+		ep->items_bm = vzalloc(PAGE_SIZE);
+		ep->removed_items_bm = vzalloc(PAGE_SIZE);
+		ep->events_to_uring = true;
+		if (!ep->user_header || !ep->user_index)
+			goto free_ep;
+		if (!ep->items_bm || !ep->removed_items_bm)
+			goto free_ep;
+
+		ep->header_length   = PAGE_SIZE;
+		ep->index_length    = PAGE_SIZE;
+		ep->items_bm_length = PAGE_SIZE;
+
+		*ep->user_header = (typeof(*ep->user_header)) {
+			.magic         = EPOLL_USER_HEADER_MAGIC,
+			.state         = EPOLL_USER_POLL_ACTIVE,
+			.header_length = ep->header_length,
+			.index_length  = ep->index_length,
+			.max_items_nr  = ep_max_items_nr(ep),
+			.max_index_nr  = ep_max_index_nr(ep),
+		};
+	}
+
 	mutex_init(&ep->mtx);
 	rwlock_init(&ep->lock);
 	init_waitqueue_head(&ep->wq);
@@ -1053,6 +1148,12 @@ static int ep_alloc(struct eventpoll **pep)
 
 	return 0;
 
+free_ep:
+	vfree(ep->user_header);
+	vfree(ep->user_index);
+	vfree(ep->items_bm);
+	vfree(ep->removed_items_bm);
+	kfree(ep);
 free_uid:
 	free_uid(user);
 	return error;
@@ -2066,12 +2167,12 @@ static int do_epoll_create(int flags)
 	/* Check the EPOLL_* constant for consistency.  */
 	BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC);
 
-	if (flags & ~EPOLL_CLOEXEC)
+	if (flags & ~(EPOLL_CLOEXEC | EPOLL_USERPOLL))
 		return -EINVAL;
 	/*
 	 * Create the internal data structure ("struct eventpoll").
 	 */
-	error = ep_alloc(&ep);
+	error = ep_alloc(&ep, flags);
 	if (error < 0)
 		return error;
 	/*
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index 39dfc29f0f52..b0a565f6c6c3 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -20,7 +20,8 @@
 #include <linux/types.h>
 
 /* Flags for epoll_create1.  */
-#define EPOLL_CLOEXEC O_CLOEXEC
+#define EPOLL_CLOEXEC  O_CLOEXEC
+#define EPOLL_USERPOLL 1
 
 /* Valid opcodes to issue to sys_epoll_ctl() */
 #define EPOLL_CTL_ADD 1
-- 
2.19.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH 06/15] epoll: introduce various of helpers for user structure lengths calculations
  2019-01-09 16:40 [RFC 00/15] epoll: support pollable epoll from userspace Roman Penyaev
                   ` (4 preceding siblings ...)
  2019-01-09 16:40 ` [RFC PATCH 05/15] epoll: introduce user header structure and user index for polling from userspace Roman Penyaev
@ 2019-01-09 16:40 ` Roman Penyaev
  2019-01-09 16:40 ` [RFC PATCH 07/15] epoll: extend epitem struct with new members for polling from userspace Roman Penyaev
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 20+ messages in thread
From: Roman Penyaev @ 2019-01-09 16:40 UTC (permalink / raw)
  Cc: Roman Penyaev, Andrew Morton, Davidlohr Bueso, Jason Baron,
	Al Viro, Paul E. McKenney, Linus Torvalds, Andrea Parri,
	linux-fsdevel, linux-kernel

Helpers for lengths-from-number and number-from-lengths calculations.
Among them:

  ep_polled_by_user()
	- returns true if epoll was created with EPOLL_USERPOLL

  ep_user_ring_events_available()
	- returns true if there is something in user ring buffer

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 fs/eventpoll.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 9ec682b6488f..ae288f62aa4c 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -438,6 +438,65 @@ static void ep_nested_calls_init(struct nested_calls *ncalls)
 	spin_lock_init(&ncalls->lock);
 }
 
+static inline unsigned int to_items_length(unsigned int nr)
+{
+	struct eventpoll *ep;
+
+	return (sizeof(*ep->user_header) +
+		(nr << ilog2(sizeof(ep->user_header->items[0]))));
+}
+
+static inline unsigned int to_index_length(unsigned int nr)
+{
+	struct eventpoll *ep;
+
+	return nr << ilog2(sizeof(*ep->user_index));
+}
+
+static inline unsigned int to_items_bm_length(unsigned int nr)
+{
+	return ALIGN(nr, 8) >> 3;
+}
+
+static inline unsigned int to_items_nr(unsigned int len)
+{
+	struct eventpoll *ep;
+
+	return (len - sizeof(*ep->user_header)) >>
+		ilog2(sizeof(ep->user_header->items[0]));
+}
+
+static inline unsigned int to_items_bm_nr(unsigned int len)
+{
+	return len << 3;
+}
+
+static inline unsigned int ep_max_items_nr(struct eventpoll *ep)
+{
+	return to_items_nr(ep->header_length);
+}
+
+static inline unsigned int ep_max_index_nr(struct eventpoll *ep)
+{
+	return ep->index_length >> ilog2(sizeof(*ep->user_index));
+}
+
+static inline unsigned int ep_max_items_bm_nr(struct eventpoll *ep)
+{
+	return to_items_bm_nr(ep->items_bm_length);
+}
+
+static inline bool ep_polled_by_user(struct eventpoll *ep)
+{
+	return !!ep->user_header;
+}
+
+static inline bool ep_user_ring_events_available(struct eventpoll *ep)
+{
+	return ep_polled_by_user(ep) &&
+		ep->user_header->head != ep->user_header->tail;
+}
+
 /**
  * ep_events_available - Checks if ready events might be available.
  *
-- 
2.19.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH 07/15] epoll: extend epitem struct with new members for polling from userspace
  2019-01-09 16:40 [RFC 00/15] epoll: support pollable epoll from userspace Roman Penyaev
                   ` (5 preceding siblings ...)
  2019-01-09 16:40 ` [RFC PATCH 06/15] epoll: introduce various of helpers for user structure lengths calculations Roman Penyaev
@ 2019-01-09 16:40 ` Roman Penyaev
  2019-01-09 16:40 ` [RFC PATCH 08/15] epoll: some sanity flags checks for epoll syscalls for polled epfd " Roman Penyaev
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 20+ messages in thread
From: Roman Penyaev @ 2019-01-09 16:40 UTC (permalink / raw)
  Cc: Roman Penyaev, Andrew Morton, Davidlohr Bueso, Jason Baron,
	Al Viro, Paul E. McKenney, Linus Torvalds, Andrea Parri,
	linux-fsdevel, linux-kernel

 ->bit
       every epitem has an element inside user item array, this bit is
       actually an index position of that user item array and also a
       bit inside ep->items_bm

 ->ready_events
       received events in the period when descriptor can't be polled
       from userspace and ep->rdllist is used for keeping list of
       ready items

 ->work
      work for offloading polling from task context if epfd is polled
      from userspace but driver does not provide pollflags on wakeup

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 fs/eventpoll.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index ae288f62aa4c..637b463587c1 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -9,6 +9,8 @@
  *
  *  Davide Libenzi <davidel@xmailserver.org>
  *
+ *  Polling from userspace support by Roman Penyaev <rpenyaev@suse.de>
+ *  (C) Copyright 2019 SUSE, All Rights Reserved
  */
 
 #include <linux/init.h>
@@ -42,6 +44,7 @@
 #include <linux/seq_file.h>
 #include <linux/compat.h>
 #include <linux/rculist.h>
+#include <linux/workqueue.h>
 #include <net/busy_poll.h>
 
 /*
@@ -176,6 +179,18 @@ struct epitem {
 
 	/* The structure that describe the interested events and the source fd */
 	struct epoll_event event;
+
+	/* Bit in user bitmap for user polling */
+	unsigned int bit;
+
+	/*
+	 * Collect ready events for the period when descriptor is polled by user
+	 * but events are routed to klists.
+	 */
+	__poll_t ready_events;
+
+	/* Work for offloading event callback */
+	struct work_struct work;
 };
 
 #define EPOLL_USER_HEADER_SIZE  128
@@ -2557,12 +2572,6 @@ static int __init eventpoll_init(void)
 	ep_nested_calls_init(&poll_safewake_ncalls);
 #endif
 
-	/*
-	 * We can have many thousands of epitems, so prevent this from
-	 * using an extra cache line on 64-bit (and smaller) CPUs
-	 */
-	BUILD_BUG_ON(sizeof(void *) <= 8 && sizeof(struct epitem) > 128);
-
 	/* Allocates slab cache used to allocate "struct epitem" items */
 	epi_cache = kmem_cache_create("eventpoll_epi", sizeof(struct epitem),
 			0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
-- 
2.19.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH 08/15] epoll: some sanity flags checks for epoll syscalls for polled epfd from userspace
  2019-01-09 16:40 [RFC 00/15] epoll: support pollable epoll from userspace Roman Penyaev
                   ` (6 preceding siblings ...)
  2019-01-09 16:40 ` [RFC PATCH 07/15] epoll: extend epitem struct with new members for polling from userspace Roman Penyaev
@ 2019-01-09 16:40 ` Roman Penyaev
  2019-01-09 16:40 ` [RFC PATCH 09/15] epoll: introduce stand-alone helpers for polling " Roman Penyaev
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 20+ messages in thread
From: Roman Penyaev @ 2019-01-09 16:40 UTC (permalink / raw)
  Cc: Roman Penyaev, Andrew Morton, Davidlohr Bueso, Jason Baron,
	Al Viro, Paul E. McKenney, Linus Torvalds, Andrea Parri,
	linux-fsdevel, linux-kernel

There are various of limitations if epfd is polled by user:

 1. Expect always EPOLLET flag (Edge Triggered behavior)

 2. No support for EPOLLWAKEUP
       events are consumed from userspace, thus no way to call __pm_relax()

 3. No support for EPOLLEXCLUSIVE
       If device does not pass pollflags to wake_up() there is no way to
       call poll() from the context under spinlock, thus special work is
       scheduled to offload polling.  In this specific case we can't
       support exclusive wakeups, because we do not know actual result
       of scheduled work.

 4. No support for nesting of epoll descriptors polled from userspace:
       no real good reason to scan ready events of user ring from the
       kernel, so just do not do that.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 fs/eventpoll.c | 78 ++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 56 insertions(+), 22 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 637b463587c1..bdaec59a847e 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -607,13 +607,17 @@ static inline void ep_set_busy_poll_napi_id(struct epitem *epi)
 #endif /* CONFIG_NET_RX_BUSY_POLL */
 
 #ifdef CONFIG_PM_SLEEP
-static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
+static inline void ep_take_care_of_epollwakeup(struct eventpoll *ep,
+					       struct epoll_event *epev)
 {
-	if ((epev->events & EPOLLWAKEUP) && !capable(CAP_BLOCK_SUSPEND))
-		epev->events &= ~EPOLLWAKEUP;
+	if (epev->events & EPOLLWAKEUP) {
+		if (!capable(CAP_BLOCK_SUSPEND) || ep_polled_by_user(ep))
+			epev->events &= ~EPOLLWAKEUP;
+	}
 }
 #else
-static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
+static inline void ep_take_care_of_epollwakeup(struct eventpoll *ep,
+					       struct epoll_event *epev)
 {
 	epev->events &= ~EPOLLWAKEUP;
 }
@@ -1054,6 +1058,7 @@ static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt,
 		return vfs_poll(epi->ffd.file, pt) & epi->event.events;
 
 	ep = epi->ffd.file->private_data;
+	WARN_ON(ep_polled_by_user(ep));
 	poll_wait(epi->ffd.file, &ep->poll_wait, pt);
 	locked = pt && (pt->_qproc == ep_ptable_queue_proc);
 
@@ -1094,6 +1099,13 @@ static __poll_t ep_eventpoll_poll(struct file *file, poll_table *wait)
 	struct eventpoll *ep = file->private_data;
 	int depth = 0;
 
+	if (ep_polled_by_user(ep))
+		/*
+		 * We do not support polling of descriptor which is polled
+		 * by user.
+		 */
+		return 0;
+
 	/* Insert inside our poll wait queue */
 	poll_wait(file, &ep->poll_wait, wait);
 
@@ -2324,10 +2336,6 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 	if (!file_can_poll(tf.file))
 		goto error_tgt_fput;
 
-	/* Check if EPOLLWAKEUP is allowed */
-	if (ep_op_has_event(op))
-		ep_take_care_of_epollwakeup(&epds);
-
 	/*
 	 * We have to check that the file structure underneath the file descriptor
 	 * the user passed to us _is_ an eventpoll file. And also we do not permit
@@ -2337,10 +2345,25 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 	if (f.file == tf.file || !is_file_epoll(f.file))
 		goto error_tgt_fput;
 
+	/*
+	 * Do not support scanning of ready events of epoll, which is pollable
+	 * by userspace.
+	 */
+	if (is_file_epoll(tf.file) && ep_polled_by_user(tf.file->private_data))
+		goto error_tgt_fput;
+
+	/*
+	 * At this point it is safe to assume that the "private_data" contains
+	 * our own data structure.
+	 */
+	ep = f.file->private_data;
+
 	/*
 	 * epoll adds to the wakeup queue at EPOLL_CTL_ADD time only,
 	 * so EPOLLEXCLUSIVE is not allowed for a EPOLL_CTL_MOD operation.
-	 * Also, we do not currently supported nested exclusive wakeups.
+	 * Also, we do not currently supported nested exclusive wakeups
+	 * and EPOLLEXCLUSIVE is not supported for epoll which is polled
+	 * from userspace.
 	 */
 	if (ep_op_has_event(op) && (epds.events & EPOLLEXCLUSIVE)) {
 		if (op == EPOLL_CTL_MOD)
@@ -2348,13 +2371,18 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 		if (op == EPOLL_CTL_ADD && (is_file_epoll(tf.file) ||
 				(epds.events & ~EPOLLEXCLUSIVE_OK_BITS)))
 			goto error_tgt_fput;
+		if (ep_polled_by_user(ep))
+			goto error_tgt_fput;
 	}
 
-	/*
-	 * At this point it is safe to assume that the "private_data" contains
-	 * our own data structure.
-	 */
-	ep = f.file->private_data;
+	if (ep_op_has_event(op)) {
+		if (ep_polled_by_user(ep) && !(epds.events & EPOLLET))
+			/* Polled by user has only edge triggered behaviour */
+			goto error_tgt_fput;
+
+		/* Check if EPOLLWAKEUP is allowed */
+		ep_take_care_of_epollwakeup(ep, &epds);
+	}
 
 	/*
 	 * When we insert an epoll file descriptor, inside another epoll file
@@ -2456,14 +2484,6 @@ static int do_epoll_wait(int epfd, struct epoll_event __user *events,
 	struct fd f;
 	struct eventpoll *ep;
 
-	/* The maximum number of event must be greater than zero */
-	if (maxevents <= 0 || maxevents > EP_MAX_EVENTS)
-		return -EINVAL;
-
-	/* Verify that the area passed by the user is writeable */
-	if (!access_ok(events, maxevents * sizeof(struct epoll_event)))
-		return -EFAULT;
-
 	/* Get the "struct file *" for the eventpoll file */
 	f = fdget(epfd);
 	if (!f.file)
@@ -2482,6 +2502,20 @@ static int do_epoll_wait(int epfd, struct epoll_event __user *events,
 	 * our own data structure.
 	 */
 	ep = f.file->private_data;
+	if (!ep_polled_by_user(ep)) {
+		/* The maximum number of event must be greater than zero */
+		if (maxevents <= 0 || maxevents > EP_MAX_EVENTS)
+			goto error_fput;
+
+		/* Verify that the area passed by the user is writeable */
+		error = -EFAULT;
+		if (!access_ok(events, maxevents * sizeof(struct epoll_event)))
+			goto error_fput;
+	} else {
+		/* Use ring instead */
+		if (maxevents != 0 || events != NULL)
+			goto error_fput;
+	}
 
 	/* Time to fish for events ... */
 	error = ep_poll(ep, events, maxevents, timeout);
-- 
2.19.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH 09/15] epoll: introduce stand-alone helpers for polling from userspace
  2019-01-09 16:40 [RFC 00/15] epoll: support pollable epoll from userspace Roman Penyaev
                   ` (7 preceding siblings ...)
  2019-01-09 16:40 ` [RFC PATCH 08/15] epoll: some sanity flags checks for epoll syscalls for polled epfd " Roman Penyaev
@ 2019-01-09 16:40 ` Roman Penyaev
  2019-01-09 17:29   ` Linus Torvalds
  2019-01-09 16:40 ` [RFC PATCH 10/15] epoll: support polling from userspace for ep_insert() Roman Penyaev
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 20+ messages in thread
From: Roman Penyaev @ 2019-01-09 16:40 UTC (permalink / raw)
  Cc: Roman Penyaev, Andrew Morton, Davidlohr Bueso, Jason Baron,
	Al Viro, Paul E. McKenney, Linus Torvalds, Andrea Parri,
	linux-fsdevel, linux-kernel

ep_vrealloc*()
    realloc user header, user index or bitmap memory

ep_get_bit()
    gets free bit from bitmap, if free bit is not found - bitmap is
    expanded on PAGE_SIZE

ep_expand_user_is_required()
    helper which returna true if expand for different memory chunks
    is required

ep_shrink_user_is_required()
    helper which returna new size if shrink for different memory chunks
    is required

ep_expand_user_*()
    expands user header or user index

ep_shrink_user_*()
    shrinks user header, user index or bitmaps.  In case of srink there
    is an important procedure of moving sparsed bits at the end to the
    beginning of the bitmap, in order to free pages at the end.

ep_route_events_to_*()
    routes events to klists or to uring.  Should be called under write
    lock, when all events are stopped.

ep_free_user_item()
    marks item inside user pointer as freed, i.e. atomically exchanges
    ready_events to 0.  Also puts item bit or postponed it to period,
    when user goes to kernel.

ep_add_event_to_uring()
    adds new event to user ring.  Firstly mark user item as ready and if
    item was observed as not ready - fill in user index.

ep_transfer_events_and_shrunk_uring()
    shrinks if needed and transfers events in klists to uring under the
    write lock.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 fs/eventpoll.c | 420 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 420 insertions(+)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index bdaec59a847e..36c451c26681 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -929,6 +929,238 @@ static void epi_rcu_free(struct rcu_head *head)
 	kmem_cache_free(epi_cache, epi);
 }
 
+static int ep_vrealloc(void **pptr, unsigned int size)
+{
+	void *old = *pptr, *new;
+
+	new = vrealloc(old, size);
+	if (unlikely(!new))
+		return -ENOMEM;
+	if (unlikely(new == old))
+		return 0;
+
+	*pptr = new;
+	vfree(old);
+
+	return 0;
+}
+
+static int ep_vrealloc_bm(struct eventpoll *ep, unsigned int bm_len)
+{
+	unsigned long *bm, *removed_bm;
+
+	/* Reallocate all at once */
+	bm = vrealloc(ep->items_bm, bm_len);
+	removed_bm = vrealloc(ep->removed_items_bm, bm_len);
+
+	if (unlikely(!bm || !removed_bm)) {
+		vfree(bm);
+		vfree(removed_bm);
+
+		return -ENOMEM;
+	}
+	ep->items_bm = bm;
+	ep->removed_items_bm = removed_bm;
+	ep->items_bm_length = bm_len;
+
+	return 0;
+}
+
+static int ep_get_bit(struct eventpoll *ep)
+{
+	unsigned int max_nr;
+	int bit, start_bit;
+	bool was_set;
+
+	lockdep_assert_held(&ep->mtx);
+
+	start_bit = 0;
+again:
+	max_nr = ep_max_items_bm_nr(ep);
+	bit = find_next_zero_bit(ep->items_bm, max_nr, start_bit);
+	if (bit >= max_nr) {
+		unsigned int bm_len;
+		int rc;
+
+		start_bit = max_nr;
+		bm_len = ep->items_bm_length + PAGE_SIZE;
+
+		rc = ep_vrealloc_bm(ep, bm_len);
+		if (unlikely(rc))
+			return rc;
+
+		goto again;
+	}
+
+	was_set = test_and_set_bit(bit, ep->items_bm);
+	WARN_ON(was_set);
+
+	return bit;
+}
+
+static inline bool ep_expand_user_items_is_required(struct eventpoll *ep)
+{
+	return (ep->items_nr >= ep_max_items_nr(ep));
+}
+
+static inline bool ep_expand_user_index_is_required(struct eventpoll *ep)
+{
+	return (ep->items_nr + EPOLL_USER_EXTRA_INDEX_NR)
+		>= ep_max_index_nr(ep);
+}
+
+static inline bool ep_expand_user_is_required(struct eventpoll *ep)
+{
+	return ep_expand_user_items_is_required(ep) ||
+		ep_expand_user_index_is_required(ep);
+}
+
+static inline unsigned int ep_shrunk_user_index_length(struct eventpoll *ep)
+{
+	unsigned int len, nr;
+
+	nr = ep->items_nr + EPOLL_USER_EXTRA_INDEX_NR;
+	len = PAGE_ALIGN(to_index_length(nr) + (PAGE_SIZE >> 1));
+	if (len < ep->index_length)
+		return len;
+
+	return 0;
+}
+
+static inline unsigned int ep_shrunk_user_items_length(struct eventpoll *ep)
+{
+	unsigned int len;
+
+	len = PAGE_ALIGN(to_items_length(ep->items_nr) + (PAGE_SIZE >> 1));
+	if (len < ep->header_length)
+		return len;
+
+	return 0;
+}
+
+static inline unsigned int ep_shrunk_items_bm_length(struct eventpoll *ep)
+{
+	unsigned int len;
+
+	len = PAGE_ALIGN(to_items_bm_length(ep->items_nr) + (PAGE_SIZE >> 1));
+	if (len < ep->items_bm_length)
+		return len;
+
+	return 0;
+}
+
+static inline bool ep_shrink_user_is_required(struct eventpoll *ep)
+{
+	return ep_shrunk_user_items_length(ep) != 0 ||
+		ep_shrunk_user_index_length(ep) != 0 ||
+		ep_shrunk_items_bm_length(ep) != 0;
+}
+
+static inline void ep_route_events_to_klists(struct eventpoll *ep)
+{
+	WARN_ON(!ep_polled_by_user(ep));
+	ep->events_to_uring = false;
+	ep->user_header->state = EPOLL_USER_POLL_INACTIVE;
+	/* Make sure userspace sees INACTIVE state ASAP */
+	smp_wmb();
+}
+
+static inline void ep_route_events_to_uring(struct eventpoll *ep)
+{
+	WARN_ON(!ep_polled_by_user(ep));
+	ep->events_to_uring = true;
+	/* Commit all previous writes to user header */
+	smp_wmb();
+	ep->user_header->state = EPOLL_USER_POLL_ACTIVE;
+}
+
+static inline bool ep_events_routed_to_klists(struct eventpoll *ep)
+{
+	return !ep->events_to_uring;
+}
+
+static inline bool ep_events_routed_to_uring(struct eventpoll *ep)
+{
+	return ep->events_to_uring;
+}
+
+static inline bool ep_free_user_item(struct epitem *epi)
+{
+	struct eventpoll *ep = epi->ep;
+	struct user_epitem *uitem;
+
+	bool events_to_klist = false;
+
+	lockdep_assert_held(&ep->mtx);
+
+	ep->items_nr--;
+
+	uitem = &ep->user_header->items[epi->bit];
+
+	/* Firstly drop item events passed from userland */
+	memset(&uitem->event, 0, sizeof(uitem->event));
+
+	/*
+	 * If event is not signaled yet and has been already consumed by
+	 * userspace it is safe to reuse the bit immediately, i.e. just
+	 * put it.  If userspace has not been yet consumed this event
+	 * we set the bit in removed bitmap in order to put it later.
+	 */
+	if (xchg(&uitem->ready_events, 0)) {
+		set_bit(epi->bit, ep->removed_items_bm);
+		events_to_klist = true;
+	} else {
+		/*
+		 * Should not be reordered with memset above, thus unlock
+		 * semantics.
+		 */
+		clear_bit_unlock(epi->bit, ep->items_bm);
+		events_to_klist = ep_shrink_user_is_required(ep);
+	}
+
+	return events_to_klist;
+}
+
+static bool ep_add_event_to_uring(struct epitem *epi, __poll_t pollflags)
+{
+	struct eventpoll *ep = epi->ep;
+	struct user_epitem *uitem;
+	bool added = false;
+
+	if (WARN_ON(!pollflags))
+		return false;
+
+	uitem = &ep->user_header->items[epi->bit];
+	if (!__atomic_fetch_or(&uitem->ready_events, pollflags,
+			       __ATOMIC_ACQUIRE)) {
+		unsigned int i, *item_idx, index_mask;
+
+		/*
+		 * Item was not ready before, thus we have to insert
+		 * new index to the ring.
+		 */
+
+		index_mask = ep_max_index_nr(ep) - 1;
+		i = __atomic_fetch_add(&ep->user_header->tail, 1,
+				       __ATOMIC_ACQUIRE);
+		item_idx = &ep->user_index[i & index_mask];
+
+		/* Signal with a bit, which is > 0 */
+		*item_idx = epi->bit + 1;
+
+		/*
+		 * Want index update be flushed from CPU write buffer and
+		 * immediately visible on userspace side to avoid long busy
+		 * loops.
+		 */
+		smp_wmb();
+
+		added = true;
+	}
+
+	return added;
+}
+
 /*
  * Removes a "struct epitem" from the eventpoll RB tree and deallocates
  * all the associated resources. Must be called with "mtx" held.
@@ -1695,6 +1927,44 @@ static noinline void ep_destroy_wakeup_source(struct epitem *epi)
 	wakeup_source_unregister(ws);
 }
 
+static int ep_expand_user_items(struct eventpoll *ep)
+{
+	unsigned int len;
+	int rc;
+
+	if (!ep_expand_user_items_is_required(ep))
+		/* Expanding is not needed */
+		return 0;
+
+	len = ep->header_length + PAGE_SIZE;
+	rc = ep_vrealloc((void **)&ep->user_header, len);
+	if (unlikely(rc))
+		return rc;
+
+	ep->header_length = len;
+
+	return 0;
+}
+
+static int ep_expand_user_index(struct eventpoll *ep)
+{
+	unsigned int len;
+	int rc;
+
+	if (!ep_expand_user_index_is_required(ep))
+		/* Expanding is not needed */
+		return 0;
+
+	len = ep->index_length + PAGE_SIZE;
+	rc = ep_vrealloc((void **)&ep->user_index, len);
+	if (unlikely(rc))
+		return rc;
+
+	ep->index_length = len;
+
+	return 0;
+}
+
 /*
  * Must be called with "mtx" held.
  */
@@ -2010,6 +2280,156 @@ static inline struct timespec64 ep_set_mstimeout(long ms)
 	return timespec64_add_safe(now, ts);
 }
 
+static int ep_shrink_user_index(struct eventpoll *ep)
+{
+	unsigned int len;
+	int rc;
+
+	len = ep_shrunk_user_index_length(ep);
+	if (!len)
+		/* Shrinking is not needed */
+		return 0;
+
+	rc = ep_vrealloc((void **)&ep->user_index, len);
+	if (unlikely(rc))
+		return rc;
+
+	ep->index_length = len;
+
+	return 0;
+}
+
+static int ep_shrink_user_items_and_bm(struct eventpoll *ep)
+{
+	unsigned int header_len, bm_len;
+	unsigned int bit, last_bit = UINT_MAX;
+	int rc;
+
+	struct rb_node *rbp;
+	struct epitem *epi;
+
+	lockdep_assert_held(&ep->mtx);
+
+	header_len = ep_shrunk_user_items_length(ep);
+	bm_len = ep_shrunk_items_bm_length(ep);
+	if (!header_len && !bm_len)
+		/* Shrinking is not needed */
+		return 0;
+
+	/*
+	 * Find left most last bit
+	 */
+	if (header_len)
+		last_bit = to_items_nr(header_len);
+	if (bm_len)
+		last_bit = min(last_bit, to_items_bm_nr(header_len));
+
+	if (WARN_ON(last_bit <= ep->items_nr))
+		return -EINVAL;
+
+	/*
+	 * Find bits from the right and move them to the left in order to
+	 * free space on the right.
+	 *
+	 * This is not nice, because O(n), but frankly this operation should
+	 * be quite rare.  If not - let's switch to idr or something similar
+	 * (but that obviously will consume more memory).
+	 *
+	 */
+	bit = 0;
+	for (rbp = rb_first_cached(&ep->rbr); rbp; rbp = rb_next(rbp)) {
+		epi = rb_entry(rbp, struct epitem, rbn);
+
+		if (epi->bit >= last_bit) {
+			/* Find first available bit from left */
+			bit = find_next_zero_bit(ep->items_bm, last_bit, bit);
+			if (WARN_ON(bit >= last_bit))
+				return -EINVAL;
+
+			/* Clear old bit from right */
+			clear_bit(epi->bit, ep->items_bm);
+
+			/*
+			 * Set item bit and advance an iterator for the
+			 * following find_next_zero_bit() call.
+			 */
+			epi->bit = bit++;
+		}
+	}
+
+	/*
+	 * Reallocate memory and commit lengths
+	 */
+	if (header_len) {
+		rc = ep_vrealloc((void **)&ep->user_header, header_len);
+		if (unlikely(rc))
+			return rc;
+
+		ep->header_length = header_len;
+	}
+	if (bm_len) {
+		rc = ep_vrealloc_bm(ep, bm_len);
+		if (unlikely(rc))
+			return rc;
+	}
+
+	return 0;
+}
+
+static inline void ep_put_postponed_user_items_bits(struct eventpoll *ep)
+{
+	size_t sz, i;
+
+	lockdep_assert_held(&ep->mtx);
+
+	sz = ep->items_bm_length >> ilog2(sizeof(ep->items_bm[0]));
+	for (i = 0; i < sz; i++) {
+		ep->items_bm[i] &= ~(ep->removed_items_bm[i]);
+		ep->removed_items_bm[i] = 0ul;
+	}
+}
+
+static int ep_transfer_events_and_shrink_uring(struct eventpoll *ep)
+{
+	struct epitem *epi, *tmp;
+	int rc = 0;
+
+	mutex_lock(&ep->mtx);
+	if (ep_events_routed_to_uring(ep))
+		/* A bit late */
+		goto unlock;
+
+	/* Here at this point we are sure uring is empty */
+	ep_put_postponed_user_items_bits(ep);
+
+	rc = ep_shrink_user_index(ep);
+	if (unlikely(rc))
+		goto unlock;
+
+	rc = ep_shrink_user_items_and_bm(ep);
+	if (unlikely(rc))
+		goto unlock;
+
+	/* Commit lengths to userspace, but state is not yet ACTIVE */
+	ep->user_header->index_length = ep->index_length;
+	ep->user_header->header_length = ep->header_length;
+
+	write_lock_irq(&ep->lock);
+	/* Atomically transfer events from klists to uring */
+	list_for_each_entry_safe(epi, tmp, &ep->rdllist, rdllink) {
+		ep_add_event_to_uring(epi, epi->ready_events);
+		list_del_init(&epi->rdllink);
+		epi->ready_events = 0;
+	}
+	ep_route_events_to_uring(ep);
+	write_unlock_irq(&ep->lock);
+
+unlock:
+	mutex_unlock(&ep->mtx);
+
+	return rc;
+}
+
 /**
  * ep_poll - Retrieves ready events, and delivers them to the caller supplied
  *           event buffer.
-- 
2.19.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH 10/15] epoll: support polling from userspace for ep_insert()
  2019-01-09 16:40 [RFC 00/15] epoll: support pollable epoll from userspace Roman Penyaev
                   ` (8 preceding siblings ...)
  2019-01-09 16:40 ` [RFC PATCH 09/15] epoll: introduce stand-alone helpers for polling " Roman Penyaev
@ 2019-01-09 16:40 ` Roman Penyaev
  2019-01-09 16:40 ` [RFC PATCH 11/15] epoll: offload polling to a work in case of epfd polled from userspace Roman Penyaev
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 20+ messages in thread
From: Roman Penyaev @ 2019-01-09 16:40 UTC (permalink / raw)
  Cc: Roman Penyaev, Andrew Morton, Davidlohr Bueso, Jason Baron,
	Al Viro, Paul E. McKenney, Linus Torvalds, Andrea Parri,
	linux-fsdevel, linux-kernel

When epfd is polled by userspace and new item is inserted:

1. Get free bit for a new item.
2. If expand for user items or user index is required - route all events
   to kernel lists and do expand.
3. If events are ready for newly inserted item - add event to uring,
   if events have been just routed to klists - add item to rdllist.
4. On error path mark user item as freed and route events to klist
   if ready event has not yet been observed by userspace.  That is
   needed to postpone bit put, otherwise newly allocated bit will
   corrupt user item.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 fs/eventpoll.c | 74 ++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 65 insertions(+), 9 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 36c451c26681..4618db9c077c 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1977,6 +1977,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
 	struct epitem *epi;
 	struct ep_pqueue epq;
 
+	lockdep_assert_held(&ep->mtx);
 	lockdep_assert_irqs_enabled();
 
 	user_watches = atomic_long_read(&ep->user->epoll_watches);
@@ -2002,6 +2003,43 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
 		RCU_INIT_POINTER(epi->ws, NULL);
 	}
 
+	if (ep_polled_by_user(ep)) {
+		struct user_epitem *uitem;
+		int bit;
+
+		bit = ep_get_bit(ep);
+		if (unlikely(bit < 0)) {
+			error = bit;
+			goto error_get_bit;
+		}
+		epi->bit = bit;
+		ep->items_nr++;
+
+		if (ep_expand_user_is_required(ep)) {
+			/*
+			 * Expand of user header or user index is required,
+			 * thus reroute all events to klists and then safely
+			 * vrealloc() the memory.
+			 */
+			write_lock_irq(&ep->lock);
+			ep_route_events_to_klists(ep);
+			write_unlock_irq(&ep->lock);
+
+			error = ep_expand_user_items(ep);
+			if (unlikely(error))
+				goto error_expand;
+
+			error = ep_expand_user_index(ep);
+			if (unlikely(error))
+				goto error_expand;
+		}
+
+		/* Now fill-in user item */
+		uitem = &ep->user_header->items[epi->bit];
+		uitem->ready_events = 0;
+		uitem->event = *event;
+	}
+
 	/* Initialize the poll table using the queue callback */
 	epq.epi = epi;
 	init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);
@@ -2046,16 +2084,23 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
 	/* record NAPI ID of new item if present */
 	ep_set_busy_poll_napi_id(epi);
 
-	/* If the file is already "ready" we drop it inside the ready list */
-	if (revents && !ep_is_linked(epi)) {
-		list_add_tail(&epi->rdllink, &ep->rdllist);
-		ep_pm_stay_awake(epi);
+	if (revents) {
+		bool added = false;
 
-		/* Notify waiting tasks that events are available */
-		if (waitqueue_active(&ep->wq))
-			wake_up(&ep->wq);
-		if (waitqueue_active(&ep->poll_wait))
-			pwake++;
+		if (ep_events_routed_to_uring(ep))
+			added = ep_add_event_to_uring(epi, revents);
+		else if (!ep_is_linked(epi)) {
+			list_add_tail(&epi->rdllink, &ep->rdllist);
+			ep_pm_stay_awake(epi);
+			added = true;
+		}
+		if (added) {
+			/* Notify waiting tasks that events are available */
+			if (waitqueue_active(&ep->wq))
+				wake_up(&ep->wq);
+			if (waitqueue_active(&ep->poll_wait))
+				pwake++;
+		}
 	}
 
 	write_unlock_irq(&ep->lock);
@@ -2089,6 +2134,17 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
 		list_del_init(&epi->rdllink);
 	write_unlock_irq(&ep->lock);
 
+	if (ep_polled_by_user(ep)) {
+error_expand:
+		/*
+		 * No need to check return value: if events are routed to
+		 * klists, that is done by code above, where we've expanded
+		 * memory, but here, on rollback, we do not care.
+		 */
+		(void)ep_free_user_item(epi);
+	}
+
+error_get_bit:
 	wakeup_source_unregister(ep_wakeup_source(epi));
 
 error_create_wakeup_source:
-- 
2.19.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH 11/15] epoll: offload polling to a work in case of epfd polled from userspace
  2019-01-09 16:40 [RFC 00/15] epoll: support pollable epoll from userspace Roman Penyaev
                   ` (9 preceding siblings ...)
  2019-01-09 16:40 ` [RFC PATCH 10/15] epoll: support polling from userspace for ep_insert() Roman Penyaev
@ 2019-01-09 16:40 ` Roman Penyaev
  2019-01-09 16:40 ` [RFC PATCH 12/15] epoll: support polling from userspace for ep_remove() Roman Penyaev
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 20+ messages in thread
From: Roman Penyaev @ 2019-01-09 16:40 UTC (permalink / raw)
  Cc: Roman Penyaev, Andrew Morton, Davidlohr Bueso, Jason Baron,
	Al Viro, Paul E. McKenney, Linus Torvalds, Andrea Parri,
	linux-fsdevel, linux-kernel

Not every device reports pollflags on wake_up(), expecting that it will be
polled later.  vfs_poll() can't be called from ep_poll_callback(), because
ep_poll_callback() is called under the spinlock.  Obviously userspace can't
call vfs_poll(), thus epoll has to offload vfs_poll() to a work and then to
call ep_poll_callback() with pollflags in a hand.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 fs/eventpoll.c | 111 ++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 87 insertions(+), 24 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 4618db9c077c..2af849e6c7a5 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1624,9 +1624,8 @@ static inline bool chain_epi_lockless(struct epitem *epi)
 }
 
 /*
- * This is the callback that is passed to the wait queue wakeup
- * mechanism. It is called by the stored file descriptors when they
- * have events to report.
+ * This is the callback that is called directly from wake queue wakeup or
+ * from a work.
  *
  * This callback takes a read lock in order not to content with concurrent
  * events from another file descriptors, thus all modifications to ->rdllist
@@ -1641,14 +1640,11 @@ static inline bool chain_epi_lockless(struct epitem *epi)
  * queues are used should be detected accordingly.  This is detected using
  * cmpxchg() operation.
  */
-static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
+static int ep_poll_callback(struct epitem *epi, __poll_t pollflags)
 {
-	int pwake = 0;
-	struct epitem *epi = ep_item_from_wait(wait);
 	struct eventpoll *ep = epi->ep;
-	__poll_t pollflags = key_to_poll(key);
+	int pwake = 0, ewake = 0;
 	unsigned long flags;
-	int ewake = 0;
 
 	read_lock_irqsave(&ep->lock, flags);
 
@@ -1666,12 +1662,32 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 	/*
 	 * Check the events coming with the callback. At this stage, not
 	 * every device reports the events in the "key" parameter of the
-	 * callback. We need to be able to handle both cases here, hence the
-	 * test for "key" != NULL before the event match test.
+	 * callback (for ep_poll_callback() case special worker is used).
+	 * We need to be able to handle both cases here, hence the test
+	 * for "key" != NULL before the event match test.
 	 */
 	if (pollflags && !(pollflags & epi->event.events))
 		goto out_unlock;
 
+	if (ep_polled_by_user(ep)) {
+		__poll_t revents;
+
+		if (ep_events_routed_to_uring(ep)) {
+			ep_add_event_to_uring(epi, pollflags);
+			goto wakeup;
+		}
+
+		WARN_ON(!pollflags);
+		revents = (epi->event.events & ~EP_PRIVATE_BITS) & pollflags;
+
+		/*
+		 * Keep active events up-to-date for further transfer from
+		 * klists to uring.
+		 */
+		__atomic_fetch_or(&epi->ready_events, revents,
+				  __ATOMIC_RELAXED);
+	}
+
 	/*
 	 * If we are transferring events to userspace, we can hold no locks
 	 * (because we're accessing user memory, and because of linux f_op->poll()
@@ -1679,6 +1695,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 	 * chained in ep->ovflist and requeued later on.
 	 */
 	if (READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR) {
+		WARN_ON(ep_polled_by_user(ep));
 		if (epi->next == EP_UNACTIVE_PTR &&
 		    chain_epi_lockless(epi))
 			ep_pm_stay_awake_rcu(epi);
@@ -1691,6 +1708,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 		ep_pm_stay_awake_rcu(epi);
 	}
 
+wakeup:
 	/*
 	 * Wake up ( if active ) both the eventpoll wait list and the ->poll()
 	 * wait list.
@@ -1727,23 +1745,67 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 	if (!(epi->event.events & EPOLLEXCLUSIVE))
 		ewake = 1;
 
-	if (pollflags & POLLFREE) {
-		/*
-		 * If we race with ep_remove_wait_queue() it can miss
-		 * ->whead = NULL and do another remove_wait_queue() after
-		 * us, so we can't use __remove_wait_queue().
-		 */
-		list_del_init(&wait->entry);
+	return ewake;
+}
+
+static void ep_poll_callback_work(struct work_struct *work)
+{
+	struct epitem *epi = container_of(work, typeof(*epi), work);
+	__poll_t pollflags;
+	poll_table pt;
+
+	WARN_ON(!ep_polled_by_user(epi->ep));
+
+	init_poll_funcptr(&pt, NULL);
+	pollflags = ep_item_poll(epi, &pt, 1);
+
+	(void)ep_poll_callback(epi, pollflags);
+}
+
+/*
+ * This is the callback that is passed to the wait queue wakeup
+ * mechanism. It is called by the stored file descriptors when they
+ * have events to report.
+ */
+static int ep_poll_wakeup(wait_queue_entry_t *wait, unsigned int mode,
+			  int sync, void *key)
+{
+
+	struct epitem *epi = ep_item_from_wait(wait);
+	struct eventpoll *ep = epi->ep;
+	__poll_t pollflags = key_to_poll(key);
+	int rc;
+
+	if (!ep_polled_by_user(ep) || pollflags) {
+		rc = ep_poll_callback(epi, pollflags);
+
+		if (pollflags & POLLFREE) {
+			/*
+			 * If we race with ep_remove_wait_queue() it can miss
+			 * ->whead = NULL and do another remove_wait_queue()
+			 * after us, so we can't use __remove_wait_queue().
+			 */
+			list_del_init(&wait->entry);
+			/*
+			 * ->whead != NULL protects us from the race with
+			 * ep_free() or ep_remove(), ep_remove_wait_queue()
+			 * takes whead->lock held by the caller. Once we nullify
+			 * it, nothing protects ep/epi or even wait.
+			 */
+			smp_store_release(&ep_pwq_from_wait(wait)->whead, NULL);
+		}
+	} else {
+		schedule_work(&epi->work);
+
 		/*
-		 * ->whead != NULL protects us from the race with ep_free()
-		 * or ep_remove(), ep_remove_wait_queue() takes whead->lock
-		 * held by the caller. Once we nullify it, nothing protects
-		 * ep/epi or even wait.
+		 * Here on this path we are absolutely sure that for file
+		 * descriptors* which are pollable from userspace we do not
+		 * support EPOLLEXCLUSIVE, so it is safe to return 1.
 		 */
-		smp_store_release(&ep_pwq_from_wait(wait)->whead, NULL);
+		rc = 1;
 	}
 
-	return ewake;
+	return rc;
 }
 
 /*
@@ -1757,7 +1819,7 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
 	struct eppoll_entry *pwq;
 
 	if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) {
-		init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
+		init_waitqueue_func_entry(&pwq->wait, ep_poll_wakeup);
 		pwq->whead = whead;
 		pwq->base = epi;
 		if (epi->event.events & EPOLLEXCLUSIVE)
@@ -1990,6 +2052,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
 	INIT_LIST_HEAD(&epi->rdllink);
 	INIT_LIST_HEAD(&epi->fllink);
 	INIT_LIST_HEAD(&epi->pwqlist);
+	INIT_WORK(&epi->work, ep_poll_callback_work);
 	epi->ep = ep;
 	ep_set_ffd(&epi->ffd, tfile, fd);
 	epi->event = *event;
-- 
2.19.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH 12/15] epoll: support polling from userspace for ep_remove()
  2019-01-09 16:40 [RFC 00/15] epoll: support pollable epoll from userspace Roman Penyaev
                   ` (10 preceding siblings ...)
  2019-01-09 16:40 ` [RFC PATCH 11/15] epoll: offload polling to a work in case of epfd polled from userspace Roman Penyaev
@ 2019-01-09 16:40 ` Roman Penyaev
  2019-01-09 16:40 ` [RFC PATCH 13/15] epoll: support polling from userspace for ep_modify() Roman Penyaev
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 20+ messages in thread
From: Roman Penyaev @ 2019-01-09 16:40 UTC (permalink / raw)
  Cc: Roman Penyaev, Andrew Morton, Davidlohr Bueso, Jason Baron,
	Al Viro, Paul E. McKenney, Linus Torvalds, Andrea Parri,
	linux-fsdevel, linux-kernel

When epfd is polled from userspace and item is being removed:

1. Mark user item as freed.  If userspace has not been yet consumed
   ready event - route all events to kernel lists.
2. If shrink is required - route all events to kernel lists.
3. On unregistration of epoll entries do not forget to flush item worker,
   which can be just submitted from ep_poll_callback()

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 fs/eventpoll.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 2af849e6c7a5..7732a8029a1c 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -780,6 +780,14 @@ static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi)
 		ep_remove_wait_queue(pwq);
 		kmem_cache_free(pwq_cache, pwq);
 	}
+	if (ep_polled_by_user(ep)) {
+		/*
+		 * Events polled by user require offloading to a work,
+		 * thus we have to be sure everything which was queued
+		 * has run to a completion.
+		 */
+		flush_work(&epi->work);
+	}
 }
 
 /* call only when ep->mtx is held */
@@ -1168,6 +1176,7 @@ static bool ep_add_event_to_uring(struct epitem *epi, __poll_t pollflags)
 static int ep_remove(struct eventpoll *ep, struct epitem *epi)
 {
 	struct file *file = epi->ffd.file;
+	bool events_to_klists = false;
 
 	lockdep_assert_irqs_enabled();
 
@@ -1183,9 +1192,14 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
 
 	rb_erase_cached(&epi->rbn, &ep->rbr);
 
+	if (ep_polled_by_user(ep))
+		events_to_klists = ep_free_user_item(epi);
+
 	write_lock_irq(&ep->lock);
 	if (ep_is_linked(epi))
 		list_del_init(&epi->rdllink);
+	if (events_to_klists)
+		ep_route_events_to_klists(ep);
 	write_unlock_irq(&ep->lock);
 
 	wakeup_source_unregister(ep_wakeup_source(epi));
-- 
2.19.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH 13/15] epoll: support polling from userspace for ep_modify()
  2019-01-09 16:40 [RFC 00/15] epoll: support pollable epoll from userspace Roman Penyaev
                   ` (11 preceding siblings ...)
  2019-01-09 16:40 ` [RFC PATCH 12/15] epoll: support polling from userspace for ep_remove() Roman Penyaev
@ 2019-01-09 16:40 ` Roman Penyaev
  2019-01-09 16:40 ` [RFC PATCH 14/15] epoll: support polling from userspace for ep_poll() Roman Penyaev
  2019-01-09 16:40 ` [RFC PATCH 15/15] epoll: support mapping for epfd when polled from userspace Roman Penyaev
  14 siblings, 0 replies; 20+ messages in thread
From: Roman Penyaev @ 2019-01-09 16:40 UTC (permalink / raw)
  Cc: Roman Penyaev, Andrew Morton, Davidlohr Bueso, Jason Baron,
	Al Viro, Paul E. McKenney, Linus Torvalds, Andrea Parri,
	linux-fsdevel, linux-kernel

When epfd is polled from userspace and item is being modified:

1. Update user item with new pointer or poll flags.
2. Add event to user ring if needed.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 fs/eventpoll.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 7732a8029a1c..2b38a3d884e8 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2237,6 +2237,8 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
 static int ep_modify(struct eventpoll *ep, struct epitem *epi,
 		     const struct epoll_event *event)
 {
+	struct user_epitem *uitem;
+	__poll_t revents;
 	int pwake = 0;
 	poll_table pt;
 
@@ -2251,6 +2253,13 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi,
 	 */
 	epi->event.events = event->events; /* need barrier below */
 	epi->event.data = event->data; /* protected by mtx */
+
+	/* Update user item, barrier is below */
+	if (ep_polled_by_user(ep)) {
+		uitem = &ep->user_header->items[epi->bit];
+		uitem->event = *event;
+	}
+
 	if (epi->event.events & EPOLLWAKEUP) {
 		if (!ep_has_wakeup_source(epi))
 			ep_create_wakeup_source(epi);
@@ -2284,12 +2293,19 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi,
 	 * If the item is "hot" and it is not registered inside the ready
 	 * list, push it inside.
 	 */
-	if (ep_item_poll(epi, &pt, 1)) {
+	revents = ep_item_poll(epi, &pt, 1);
+	if (revents) {
+		bool added = false;
+
 		write_lock_irq(&ep->lock);
-		if (!ep_is_linked(epi)) {
+		if (ep_events_routed_to_uring(ep))
+			added = ep_add_event_to_uring(epi, revents);
+		else if (!ep_is_linked(epi)) {
 			list_add_tail(&epi->rdllink, &ep->rdllist);
 			ep_pm_stay_awake(epi);
-
+			added = true;
+		}
+		if (added) {
 			/* Notify waiting tasks that events are available */
 			if (waitqueue_active(&ep->wq))
 				wake_up(&ep->wq);
-- 
2.19.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH 14/15] epoll: support polling from userspace for ep_poll()
  2019-01-09 16:40 [RFC 00/15] epoll: support pollable epoll from userspace Roman Penyaev
                   ` (12 preceding siblings ...)
  2019-01-09 16:40 ` [RFC PATCH 13/15] epoll: support polling from userspace for ep_modify() Roman Penyaev
@ 2019-01-09 16:40 ` Roman Penyaev
  2019-01-09 16:40 ` [RFC PATCH 15/15] epoll: support mapping for epfd when polled from userspace Roman Penyaev
  14 siblings, 0 replies; 20+ messages in thread
From: Roman Penyaev @ 2019-01-09 16:40 UTC (permalink / raw)
  Cc: Roman Penyaev, Andrew Morton, Davidlohr Bueso, Jason Baron,
	Al Viro, Paul E. McKenney, Linus Torvalds, Andrea Parri,
	linux-fsdevel, linux-kernel

When epfd is polled from userspace and user calls epoll_wait():

1. If user ring is not fully consumed (i.e. head != tail) returns
   -ESTALE, indicating that some actions on userside is required.

2. If events were routed to klists probably memory was expanded or
   shrink is still required.  Do shrink if needed and transfer all
   collected events from kernel lists to uring.

3. Ensure with WARN that ep_poll_send_events() can't be called from
   ep_poll() when epfd is pollable from userspace.

4. Wait for events on wait queue, always return -ESTALE if were
   awekened indicating that events have to be consumed from user ring.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 fs/eventpoll.c | 46 +++++++++++++++++++++++++++++++++++++---------
 1 file changed, 37 insertions(+), 9 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 2b38a3d884e8..5de640fcf28b 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -523,7 +523,8 @@ static inline bool ep_user_ring_events_available(struct eventpoll *ep)
 static inline int ep_events_available(struct eventpoll *ep)
 {
 	return !list_empty_careful(&ep->rdllist) ||
-		READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR;
+		READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR ||
+		ep_user_ring_events_available(ep);
 }
 
 #ifdef CONFIG_NET_RX_BUSY_POLL
@@ -2411,6 +2412,8 @@ static int ep_send_events(struct eventpoll *ep,
 {
 	struct ep_send_events_data esed;
 
+	WARN_ON(ep_polled_by_user(ep));
+
 	esed.maxevents = maxevents;
 	esed.events = events;
 
@@ -2607,6 +2610,24 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 
 	lockdep_assert_irqs_enabled();
 
+	if (ep_polled_by_user(ep)) {
+		if (ep_user_ring_events_available(ep))
+			/* Firstly all events from ring have to be consumed */
+			return -ESTALE;
+
+		if (ep_events_routed_to_klists(ep)) {
+			res = ep_transfer_events_and_shrink_uring(ep);
+			if (unlikely(res < 0))
+				return res;
+			if (res)
+				/*
+				 * Events were transferred from klists to
+				 * user ring
+				 */
+				return -ESTALE;
+		}
+	}
+
 	if (timeout > 0) {
 		struct timespec64 end_time = ep_set_mstimeout(timeout);
 
@@ -2695,14 +2716,21 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 	__set_current_state(TASK_RUNNING);
 
 send_events:
-	/*
-	 * Try to transfer events to user space. In case we get 0 events and
-	 * there's still timeout left over, we go trying again in search of
-	 * more luck.
-	 */
-	if (!res && eavail &&
-	    !(res = ep_send_events(ep, events, maxevents)) && !timed_out)
-		goto fetch_events;
+	if (!res && eavail) {
+		if (!ep_polled_by_user(ep)) {
+			/*
+			 * Try to transfer events to user space. In case we get
+			 * 0 events and there's still timeout left over, we go
+			 * trying again in search of more luck.
+			 */
+			res = ep_send_events(ep, events, maxevents);
+			if (!res && !timed_out)
+				goto fetch_events;
+		} else {
+			/* User has to deal with the ring himself */
+			res = -ESTALE;
+		}
+	}
 
 	if (waiter) {
 		spin_lock_irq(&ep->wq.lock);
-- 
2.19.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH 15/15] epoll: support mapping for epfd when polled from userspace
  2019-01-09 16:40 [RFC 00/15] epoll: support pollable epoll from userspace Roman Penyaev
                   ` (13 preceding siblings ...)
  2019-01-09 16:40 ` [RFC PATCH 14/15] epoll: support polling from userspace for ep_poll() Roman Penyaev
@ 2019-01-09 16:40 ` Roman Penyaev
  14 siblings, 0 replies; 20+ messages in thread
From: Roman Penyaev @ 2019-01-09 16:40 UTC (permalink / raw)
  Cc: Roman Penyaev, Andrew Morton, Davidlohr Bueso, Jason Baron,
	Al Viro, Paul E. McKenney, Linus Torvalds, Andrea Parri,
	linux-fsdevel, linux-kernel

User has to mmap user_header and user_index vmalloce'd pointers in order to
consume events from userspace.  Support mapping with possibility to mremap()
in the future, i.e. vma does not have VM_DONTEXPAND flag set.

User mmaps two pointers: header and index in order to expand both calling
mremap().

Expanding is made with support of the fault callback, where page is mmaped
with all appropriate size checks.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 fs/eventpoll.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 85 insertions(+)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 5de640fcf28b..2849b238f80b 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1388,11 +1388,96 @@ static void ep_show_fdinfo(struct seq_file *m, struct file *f)
 }
 #endif
 
+static vm_fault_t ep_eventpoll_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct eventpoll *ep = vma->vm_file->private_data;
+	size_t off = vmf->address - vma->vm_start;
+	vm_fault_t ret;
+	int rc;
+
+	mutex_lock(&ep->mtx);
+	ret = VM_FAULT_SIGBUS;
+	if (!vma->vm_pgoff) {
+		if (ep->header_length < (off + PAGE_SIZE))
+			goto unlock_and_out;
+
+		rc = remap_vmalloc_range_partial(vma, vmf->address,
+						 ep->user_header + off,
+						 PAGE_SIZE);
+	} else {
+		if (ep->index_length < (off + PAGE_SIZE))
+			goto unlock_and_out;
+
+		rc = remap_vmalloc_range_partial(vma, vmf->address,
+						 ep->user_index + off,
+						 PAGE_SIZE);
+	}
+	if (likely(!rc)) {
+		/* Success path */
+		vma->vm_flags &= ~VM_DONTEXPAND;
+		ret = VM_FAULT_NOPAGE;
+	}
+unlock_and_out:
+	mutex_unlock(&ep->mtx);
+
+	return ret;
+}
+
+static const struct vm_operations_struct eventpoll_vm_ops = {
+	.fault = ep_eventpoll_fault,
+};
+
+static int ep_eventpoll_mmap(struct file *filep, struct vm_area_struct *vma)
+{
+	struct eventpoll *ep = vma->vm_file->private_data;
+	size_t size;
+	int rc;
+
+	if (!ep_polled_by_user(ep))
+		return -ENOTSUPP;
+
+	mutex_lock(&ep->mtx);
+	rc = -ENXIO;
+	size = vma->vm_end - vma->vm_start;
+	if (!vma->vm_pgoff && size > ep->header_length)
+		goto unlock_and_out;
+	if (vma->vm_pgoff && ep->header_length != (vma->vm_pgoff << PAGE_SHIFT))
+		/*
+		 * Index ring starts exactly after header. In future vm_pgoff
+		 * is not used, only as indication what kernel ptr is mapped.
+		 */
+		goto unlock_and_out;
+	if (vma->vm_pgoff && size > ep->index_length)
+		goto unlock_and_out;
+
+	/*
+	 * vm_pgoff is used *only* for indication, what is mapped: user header
+	 * or user index ring.
+	 */
+	if (!vma->vm_pgoff)
+		rc = remap_vmalloc_range_partial(vma, vma->vm_start,
+						 ep->user_header, size);
+	else
+		rc = remap_vmalloc_range_partial(vma, vma->vm_start,
+						 ep->user_index, size);
+
+	if (likely(!rc)) {
+		vma->vm_flags &= ~VM_DONTEXPAND;
+		vma->vm_ops = &eventpoll_vm_ops;
+	}
+unlock_and_out:
+	mutex_unlock(&ep->mtx);
+
+	return rc;
+}
+
 /* File callbacks that implement the eventpoll file behaviour */
 static const struct file_operations eventpoll_fops = {
 #ifdef CONFIG_PROC_FS
 	.show_fdinfo	= ep_show_fdinfo,
 #endif
+	.mmap		= ep_eventpoll_mmap,
 	.release	= ep_eventpoll_release,
 	.poll		= ep_eventpoll_poll,
 	.llseek		= noop_llseek,
-- 
2.19.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 03/15] mm/vmalloc: introduce new vrealloc() call and its subsidiary reach analog
  2019-01-09 16:40 ` [RFC PATCH 03/15] mm/vmalloc: introduce new vrealloc() call and its subsidiary reach analog Roman Penyaev
@ 2019-01-09 16:50   ` Matthew Wilcox
  2019-01-10 10:08     ` Roman Penyaev
  0 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2019-01-09 16:50 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: Andrew Morton, Michal Hocko, Andrey Ryabinin, Joe Perches,
	Luis R. Rodriguez, linux-mm, linux-kernel

On Wed, Jan 09, 2019 at 05:40:13PM +0100, Roman Penyaev wrote:
> Basically vrealloc() repeats glibc realloc() with only one big difference:
> old area is not freed, i.e. caller is responsible for calling vfree() in
> case of successfull reallocation.

Ouch.  Don't call it the same thing when you're providing such different
semantics.  I agree with you that the new semantics are useful ones,
I just want it called something else.  Maybe vcopy()?  vclone()?

> + *	Do not forget to call vfree() passing old address.  But careful,
> + *	calling vfree() from interrupt will cause vfree_deferred() call,
> + *	which in its turn uses freed address as a temporal pointer for a

"temporary", not temporal.

> + *	llist element, i.e. memory will be corrupted.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 09/15] epoll: introduce stand-alone helpers for polling from userspace
  2019-01-09 16:40 ` [RFC PATCH 09/15] epoll: introduce stand-alone helpers for polling " Roman Penyaev
@ 2019-01-09 17:29   ` Linus Torvalds
  2019-01-10 10:03     ` Roman Penyaev
  0 siblings, 1 reply; 20+ messages in thread
From: Linus Torvalds @ 2019-01-09 17:29 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: Andrew Morton, Davidlohr Bueso, Jason Baron, Al Viro,
	Paul E. McKenney, Andrea Parri, linux-fsdevel,
	Linux List Kernel Mailing

On Wed, Jan 9, 2019 at 8:40 AM Roman Penyaev <rpenyaev@suse.de> wrote:
>
> ep_vrealloc*()
>     realloc user header, user index or bitmap memory

What? No.

This is wrong, it's much too complicated. And because your
'vrealloc()' doesn't follow the normal realloc rules, it looks both
confusing and buggy, and people have to remember that "oh, vrealloc()
isn't actually vrealloc(), it's really vdupalloc()".

Your other patch to allow users to apparently also do mremap of these
things seems entirely wrongheaded too. Especially when you then have
magical rules for vm_pgoff, which is one of the things that unmapping
parts of a mmap will touch.

So I say no, no, no. This is all *much* too complicated, and the
interfaces are mis-designed to be overly generous to people doing odd
and pointless things.

If you can't have a fixed-size user buffer that stays in one place,
don't even bother.

             Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 09/15] epoll: introduce stand-alone helpers for polling from userspace
  2019-01-09 17:29   ` Linus Torvalds
@ 2019-01-10 10:03     ` Roman Penyaev
  0 siblings, 0 replies; 20+ messages in thread
From: Roman Penyaev @ 2019-01-10 10:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Davidlohr Bueso, Jason Baron, Al Viro,
	Paul E. McKenney, Andrea Parri, linux-fsdevel,
	Linux List Kernel Mailing

On 2019-01-09 18:29, Linus Torvalds wrote:
> On Wed, Jan 9, 2019 at 8:40 AM Roman Penyaev <rpenyaev@suse.de> wrote:
>> 
>> ep_vrealloc*()
>>     realloc user header, user index or bitmap memory
> 
> What? No.
> 
> This is wrong, it's much too complicated. And because your
> 'vrealloc()' doesn't follow the normal realloc rules, it looks both
> confusing and buggy, and people have to remember that "oh, vrealloc()
> isn't actually vrealloc(), it's really vdupalloc()".
> 
> Your other patch to allow users to apparently also do mremap of these
> things seems entirely wrongheaded too. Especially when you then have
> magical rules for vm_pgoff, which is one of the things that unmapping
> parts of a mmap will touch.
> 
> So I say no, no, no. This is all *much* too complicated, and the
> interfaces are mis-designed to be overly generous to people doing odd
> and pointless things.
> 
> If you can't have a fixed-size user buffer that stays in one place,
> don't even bother.

I agree that set of "rules" for this interface is indeed complicated.
The goal was to solve the problem with a constantly changing set of
items (which can be increased / decreased from another thread) without
adding new ctl calls or any limitations.

To fix the size of a user buffer is seems easy to do.  One way is still
to support expand with, say, epoll_ctl(EPOLL_CTL_EXPAND) call and user
has to react explicitly on ENOSPC from epoll_ctl(EPOLL_CTL_ADD).  Thus
reallocation happens, but by user request.

Another way seems much simpler but has a limitation: user has to specify
expected max limit passing the value to a new epoll_create syscall, e.g.
epoll_create2(EPOLL_USERPOLL, 1000). Further attempt to add 1001 
descriptor
will end with ENOSPC. Period. No magic under the hood. Another 1001
descriptor can be added to a new epoll, which can be nested then (what
is forbidden for "polled from user" descriptors in current 
implementation,
but should not be difficult to allow). Then yes, no remapping / 
reallocating.
But this epoll nesting thing ... Which personally I do not like.

What do you think?

--
Roman

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 03/15] mm/vmalloc: introduce new vrealloc() call and its subsidiary reach analog
  2019-01-09 16:50   ` Matthew Wilcox
@ 2019-01-10 10:08     ` Roman Penyaev
  0 siblings, 0 replies; 20+ messages in thread
From: Roman Penyaev @ 2019-01-10 10:08 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Michal Hocko, Andrey Ryabinin, Joe Perches,
	Luis R. Rodriguez, linux-mm, linux-kernel

On 2019-01-09 17:50, Matthew Wilcox wrote:
> On Wed, Jan 09, 2019 at 05:40:13PM +0100, Roman Penyaev wrote:
>> Basically vrealloc() repeats glibc realloc() with only one big 
>> difference:
>> old area is not freed, i.e. caller is responsible for calling vfree() 
>> in
>> case of successfull reallocation.
> 
> Ouch.  Don't call it the same thing when you're providing such 
> different
> semantics.  I agree with you that the new semantics are useful ones,
> I just want it called something else.  Maybe vcopy()?  vclone()?

vclone(). I like vclone().  But Linus does not like this reallocation
under the hood for epoll (where this vrealloc() should have been used),
so seems that won't be needed at all.

> 
>> + *	Do not forget to call vfree() passing old address.  But careful,
>> + *	calling vfree() from interrupt will cause vfree_deferred() call,
>> + *	which in its turn uses freed address as a temporal pointer for a
> 
> "temporary", not temporal.

Ha! Now I got the difference.  Thanks, Mathew :)

> 
>> + *	llist element, i.e. memory will be corrupted.

--
Roman


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, back to index

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-09 16:40 [RFC 00/15] epoll: support pollable epoll from userspace Roman Penyaev
2019-01-09 16:40 ` [RFC PATCH 01/15] mm/vmalloc: add new 'alignment' field for vm_struct structure Roman Penyaev
2019-01-09 16:40 ` [RFC PATCH 02/15] mm/vmalloc: move common logic from __vmalloc_area_node to a separate func Roman Penyaev
2019-01-09 16:40 ` [RFC PATCH 03/15] mm/vmalloc: introduce new vrealloc() call and its subsidiary reach analog Roman Penyaev
2019-01-09 16:50   ` Matthew Wilcox
2019-01-10 10:08     ` Roman Penyaev
2019-01-09 16:40 ` [RFC PATCH 04/15] epoll: move private helpers from a header to the source Roman Penyaev
2019-01-09 16:40 ` [RFC PATCH 05/15] epoll: introduce user header structure and user index for polling from userspace Roman Penyaev
2019-01-09 16:40 ` [RFC PATCH 06/15] epoll: introduce various of helpers for user structure lengths calculations Roman Penyaev
2019-01-09 16:40 ` [RFC PATCH 07/15] epoll: extend epitem struct with new members for polling from userspace Roman Penyaev
2019-01-09 16:40 ` [RFC PATCH 08/15] epoll: some sanity flags checks for epoll syscalls for polled epfd " Roman Penyaev
2019-01-09 16:40 ` [RFC PATCH 09/15] epoll: introduce stand-alone helpers for polling " Roman Penyaev
2019-01-09 17:29   ` Linus Torvalds
2019-01-10 10:03     ` Roman Penyaev
2019-01-09 16:40 ` [RFC PATCH 10/15] epoll: support polling from userspace for ep_insert() Roman Penyaev
2019-01-09 16:40 ` [RFC PATCH 11/15] epoll: offload polling to a work in case of epfd polled from userspace Roman Penyaev
2019-01-09 16:40 ` [RFC PATCH 12/15] epoll: support polling from userspace for ep_remove() Roman Penyaev
2019-01-09 16:40 ` [RFC PATCH 13/15] epoll: support polling from userspace for ep_modify() Roman Penyaev
2019-01-09 16:40 ` [RFC PATCH 14/15] epoll: support polling from userspace for ep_poll() Roman Penyaev
2019-01-09 16:40 ` [RFC PATCH 15/15] epoll: support mapping for epfd when polled from userspace Roman Penyaev

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git