LKML Archive on lore.kernel.org
 help / color / Atom feed
From: Roman Penyaev <rpenyaev@suse.de>
To: Jens Axboe <axboe@kernel.dk>
Cc: Azat Khuzhin <azat@libevent.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v3 00/13] epoll: support pollable epoll from userspace
Date: Wed, 05 Jun 2019 08:17:02 +0200
Message-ID: <abf62eeb1b0be93404967c1e98190e83@suse.de> (raw)
In-Reply-To: <3d9903f6-ebf8-6b8b-6251-b3a305dc9f19@kernel.dk>

On 2019-05-31 23:09, Jens Axboe wrote:
> On 5/31/19 1:45 PM, Roman Penyaev wrote:
>> On 2019-05-31 18:54, Jens Axboe wrote:
>>> On 5/31/19 10:02 AM, Roman Penyaev wrote:
>>>> On 2019-05-31 16:48, Jens Axboe wrote:
>>>>> On 5/16/19 2:57 AM, Roman Penyaev wrote:
>>>>>> Hi all,
>>>>>> 
>>>>>> This is v3 which introduces pollable epoll from userspace.
>>>>>> 
>>>>>> v3:
>>>>>>     - Measurements made, represented below.
>>>>>> 
>>>>>>     - Fix alignment for epoll_uitem structure on all 64-bit archs
>>>>>> except
>>>>>>       x86-64. epoll_uitem should be always 16 bit, proper
>>>>>> BUILD_BUG_ON
>>>>>>       is added. (Linus)
>>>>>> 
>>>>>>     - Check pollflags explicitly on 0 inside work callback, and do
>>>>>> nothing
>>>>>>       if 0.
>>>>>> 
>>>>>> v2:
>>>>>>     - No reallocations, the max number of items (thus size of the
>>>>>> user
>>>>>> ring)
>>>>>>       is specified by the caller.
>>>>>> 
>>>>>>     - Interface is simplified: -ENOSPC is returned on attempt to 
>>>>>> add
>>>>>> a
>>>>>> new
>>>>>>       epoll item if number is reached the max, nothing more.
>>>>>> 
>>>>>>     - Alloced pages are accounted using user->locked_vm and 
>>>>>> limited
>>>>>> to
>>>>>>       RLIMIT_MEMLOCK value.
>>>>>> 
>>>>>>     - EPOLLONESHOT is handled.
>>>>>> 
>>>>>> This series introduces pollable epoll from userspace, i.e. user
>>>>>> creates
>>>>>> epfd with a new EPOLL_USERPOLL flag, mmaps epoll descriptor, gets
>>>>>> header
>>>>>> and ring pointers and then consumes ready events from a ring,
>>>>>> avoiding
>>>>>> epoll_wait() call.  When ring is empty, user has to call
>>>>>> epoll_wait()
>>>>>> in order to wait for new events.  epoll_wait() returns -ESTALE if
>>>>>> user
>>>>>> ring has events in the ring (kind of indication, that user has to
>>>>>> consume
>>>>>> events from the user ring first, I could not invent anything 
>>>>>> better
>>>>>> than
>>>>>> returning -ESTALE).
>>>>>> 
>>>>>> For user header and user ring allocation I used vmalloc_user().  I
>>>>>> found
>>>>>> that it is much easy to reuse remap_vmalloc_range_partial() 
>>>>>> instead
>>>>>> of
>>>>>> dealing with page cache (like aio.c does).  What is also nice is
>>>>>> that
>>>>>> virtual address is properly aligned on SHMLBA, thus there should 
>>>>>> not
>>>>>> be
>>>>>> any d-cache aliasing problems on archs with vivt or vipt caches.
>>>>> 
>>>>> Why aren't we just adding support to io_uring for this instead? 
>>>>> Then
>>>>> we
>>>>> don't need yet another entirely new ring, that's is just a little
>>>>> different from what we have.
>>>>> 
>>>>> I haven't looked into the details of your implementation, just
>>>>> curious
>>>>> if there's anything that makes using io_uring a non-starter for 
>>>>> this
>>>>> purpose?
>>>> 
>>>> Afaict the main difference is that you do not need to recharge an fd
>>>> (submit new poll request in terms of io_uring): once fd has been 
>>>> added
>>>> to
>>>> epoll with epoll_ctl() - we get events.  When you have thousands of
>>>> fds
>>>> -
>>>> that should matter.
>>>> 
>>>> Also interesting question is how difficult to modify existing event
>>>> loops
>>>> in event libraries in order to support recharging (EPOLLONESHOT in
>>>> terms
>>>> of epoll).
>>>> 
>>>> Maybe Azat who maintains libevent can shed light on this (currently 
>>>> I
>>>> see
>>>> that libevent does not support "EPOLLONESHOT" logic).
>>> 
>>> In terms of existing io_uring poll support, which is what I'm 
>>> guessing
>>> you're referring to, it is indeed just one-shot.
>> 
>> Yes, yes.
>> 
>>> But there's no reason  why we can't have it persist until explicitly
>>> canceled with POLL_REMOVE.
>> 
>> It seems not so easy.  The main problem is that with only a ring it is
>> impossible to figure out on kernel side what event bits have been
>> already
>> seen by the userspace and what bits are new.  So every new cqe has to
>> be added to a completion ring on each wake_up_interruptible() call.
>> (I mean when fd wants to report that something is ready).
>> 
>> IMO that can lead to many duplicate events (tens? hundreds? honestly 
>> no
>> idea), which userspace has to handle with subsequent read/write calls.
>> It can kill all performance benefits of a uring.
>> 
>> In uepoll this is solved with another piece of shared memory, where
>> userspace atomically clears bits and kernel side sets bits.  If kernel
>> observes that bits were set (i.e. userspace has not seen this event)
>> - new index is added to a ring.
> 
> Those are good points.
> 
>> Can we extend the io_uring API to support this behavior?  Also would
>> be great if we can make event path lockless.  On a big number of fds
>> and frequent events - this matters, please take a look, recently I
>> did some measurements:  https://lkml.org/lkml/2018/12/12/305
> 
> Yeah, I'd be happy to entertain that idea, and lockless completions as
> well. We already do that for polled IO, but consider any "normal"
> completion to be IRQ driven and hence need locking.

I would like to contribute as much as I can. "Subscription" on events
along with lockless ring seems reasonable to do for io_uring. I still
tend to think that uepoll and io_uring poll can coexist, at least
because it can be difficult to adopt current event libraries to async
nature of "add fd" / "remove add" requests of the io_uring, e.g. when
epoll_ctl() is called in order to remove fd, the caller expects no
events come after epoll_ctl() returns. Async behavior can break the
event loop. What can help is ability to wait on particular request,
which seems not possible without ugly tricks, right? (Under ugly tricks
I mean something as: wait for any event, traverse the completion ring
in order to meet particular completion, repeat if nothing is found).

Also epoll_ctl() can be called from another thread in order to
add/remove fd, and I suppose that is also successfully used by event
loop libraries or users of these libraries (not quite sure though, but
can imagine why it can be useful). To fix that will require introducing
locks on submission path of io_uring callers (I mean on user side,
inside these libraries), which can impact performance for generic
cases (only submission though).

What I want to say is that polling using io_uring can be used in some
new io/event stacks, but adoption of current event libraries can be
non trivial, where old plain epoll with a ring can be an easiest way.
But of course that's only my speculation.

--
Roman


  reply index

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-16  8:57 Roman Penyaev
2019-05-16  8:57 ` [PATCH v3 01/13] epoll: move private helpers from a header to the source Roman Penyaev
2019-05-16  8:57 ` [PATCH v3 02/13] epoll: introduce user structures for polling from userspace Roman Penyaev
2019-05-16  8:58 ` [PATCH v3 03/13] epoll: allocate user header and user events ring " Roman Penyaev
2019-05-16  8:58 ` [PATCH v3 04/13] epoll: some sanity flags checks for epoll syscalls " Roman Penyaev
2019-05-16  8:58 ` [PATCH v3 05/13] epoll: offload polling to a work in case of epfd polled " Roman Penyaev
2019-05-21  7:51   ` Eric Wong
2019-05-22 12:54     ` Roman Penyaev
2019-05-16  8:58 ` [PATCH v3 06/13] epoll: introduce helpers for adding/removing events to uring Roman Penyaev
2019-05-31  9:55   ` Peter Zijlstra
2019-05-31 11:24     ` Roman Penyaev
2019-05-31 13:11       ` Peter Zijlstra
2019-05-31  9:56   ` Peter Zijlstra
2019-05-31 11:15     ` Roman Penyaev
2019-05-31 12:53       ` Peter Zijlstra
2019-05-31 14:28         ` Roman Penyaev
2019-05-31 16:53           ` Peter Zijlstra
2019-05-31 12:56       ` Peter Zijlstra
2019-05-31 14:21         ` Roman Penyaev
2019-05-31 16:51           ` Peter Zijlstra
2019-05-31 18:58             ` Roman Penyaev
2019-06-03  9:09               ` Peter Zijlstra
2019-06-03 10:02                 ` Roman Penyaev
2019-05-16  8:58 ` [PATCH v3 07/13] epoll: call ep_add_event_to_uring() from ep_poll_callback() Roman Penyaev
2019-05-31  9:56   ` Peter Zijlstra
2019-05-31 11:22     ` Roman Penyaev
2019-05-31 13:05       ` Peter Zijlstra
2019-05-31 15:05         ` Roman Penyaev
2019-05-16  8:58 ` [PATCH v3 08/13] epoll: support polling from userspace for ep_insert() Roman Penyaev
2019-05-16  8:58 ` [PATCH v3 09/13] epoll: support polling from userspace for ep_remove() Roman Penyaev
2019-05-16  8:58 ` [PATCH v3 10/13] epoll: support polling from userspace for ep_modify() Roman Penyaev
2019-05-16  8:58 ` [PATCH v3 11/13] epoll: support polling from userspace for ep_poll() Roman Penyaev
2019-05-16  8:58 ` [PATCH v3 12/13] epoll: support mapping for epfd when polled from userspace Roman Penyaev
2019-05-16  8:58 ` [PATCH v3 13/13] epoll: implement epoll_create2() syscall Roman Penyaev
2019-05-16 10:03   ` Arnd Bergmann
2019-05-16 10:20     ` Roman Penyaev
2019-05-16 10:57       ` Arnd Bergmann
2019-05-22  2:33       ` Andrew Morton
2019-05-22  9:11         ` Roman Penyaev
2019-05-22 11:14         ` Arnd Bergmann
2019-05-22 18:36           ` Andrew Morton
2019-05-31  9:55 ` [PATCH v3 00/13] epoll: support pollable epoll from userspace Peter Zijlstra
2019-05-31 14:48 ` Jens Axboe
2019-05-31 16:02   ` Roman Penyaev
2019-05-31 16:54     ` Jens Axboe
2019-05-31 19:45       ` Roman Penyaev
2019-05-31 21:09         ` Jens Axboe
2019-06-05  6:17           ` Roman Penyaev [this message]
2019-05-31 16:33 ` Peter Zijlstra
2019-05-31 18:50   ` Roman Penyaev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=abf62eeb1b0be93404967c1e98190e83@suse.de \
    --to=rpenyaev@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=azat@libevent.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git