Re: epoll and multiple processes - eliminate unneeded process wake-ups

* Re: epoll and multiple processes - eliminate unneeded process wake-ups
@ 2015-11-28 22:54 Madars Vitolins
  2015-11-30 19:45 ` Jason Baron
  0 siblings, 1 reply; 11+ messages in thread
From: Madars Vitolins @ 2015-11-28 22:54 UTC (permalink / raw)
  To: Jason Baron; +Cc: Eric Wong, linux-kernel

Hi Jason,

I did recently tests with multiprocessing and epoll() on Posix Queues.
You were right about "EP_MAX_NESTS", it is not related with how many 
processes are waken up when multiple process epoll_waits are waiting on 
one event source.

At doing epoll every process is added to wait queue for every monitored 
event source. Thus when message is sent to some queue (for example), all 
processes polling on it are activated during mq_timedsend() -> 
__do_notify () -> wake_up(&info->wait_q) kernel processing.

So to get one message to be processed only by one process of 
epoll_wait(), it requires that process in event source's  wait queue is 
added with exclusive flag set.

I could create a kernel patch, by adding new EPOLLEXCL flag which could 
result in following functionality:

- fs/eventpoll.c
================================================================================
/*
  * This is the callback that is used to add our wait queue to the
  * target file wakeup lists.
  */
static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t 
*whead,
                                  poll_table *pt)
{
         struct epitem *epi = ep_item_from_epqueue(pt);
         struct eppoll_entry *pwq;

         if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, 
GFP_KERNEL))) {
                 init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
                 pwq->whead = whead;
                 pwq->base = epi;

                 if (epi->event.events & EPOLLEXCL) { <<<< New 
functionality here!!!
                         add_wait_queue_exclusive(whead, &pwq->wait);
                 } else {
                         add_wait_queue(whead, &pwq->wait);
                 }
                 list_add_tail(&pwq->llink, &epi->pwqlist);
                 epi->nwait++;
         } else {
                 /* We have to signal that an error occurred */
                 epi->nwait = -1;
         }
}
================================================================================

After doing test with EPOLLEXCL set in my multiprocessing application 
framework (now it is open source: http://www.endurox.org/ :) ), results 
were good, there were no extra wakeups. Thus more efficient processing.

Jason, how do you think would mainline accept such patch with new flag? 
Or are there any concerns about this? Also this will mean that new flag 
will be need to add to GNU C Library (/usr/include/sys/epoll.h).

Or maybe somebody else who is familiar with kernel epoll functionality 
can comment this?

Regarding the flag's bitmask, seems like (1<<28) needs to be taken for 
EPOLLEXCL as flags type for epoll_event.events is int32 and last bit 
1<<31 is used by EPOLLET (in include/uapi/linux/eventpoll.h).

Thanks a lot in advance,
Madars

Jason Baron @ 2015-08-05 15:32 rakstīja:
> On 08/05/2015 07:06 AM, Madars Vitolins wrote:
>> Jason Baron @ 2015-08-04 18:02 rakstīja:
>>> On 08/03/2015 07:48 PM, Eric Wong wrote:
>>>> Madars Vitolins <m@silodev.com> wrote:
>>>>> Hi Folks,
>>>>> 
>>>>> I am developing kind of open systems application, which uses
>>>>> multiple processes/executables where each of them monitors some set
>>>>> of resources (in this case POSIX Queues) via epoll interface. For
>>>>> example when 10 processes on same queue are in state of 
>>>>> epoll_wait()
>>>>> and one message arrives, all 10 processes gets woken up and all of
>>>>> them tries to read the message from Q. One succeeds, the others 
>>>>> gets
>>>>> EAGAIN error. The problem is with those others, which generates
>>>>> extra context switches - useless CPU usage. With more processes
>>>>> inefficiency gets higher.
>>>>> 
>>>>> I tried to use EPOLLONESHOT, but no help. Seems this is suitable 
>>>>> for
>>>>> multi-threaded application and not for multi-process application.
>>>> 
>>>> Correct.  Most FDs are not shared across processes.
>>>> 
>>>>> Ideal mechanism for this would be:
>>>>> 1. If multiple epoll sets in kernel matches same event and one or
>>>>> more processes are in state of epoll_wait() - then send event only
>>>>> to one waiter.
>>>>> 2. If none of processes are in wait state, then send the event to
>>>>> all epoll sets (as it is currently). Then the first free process
>>>>> will grab the event.
>>>> 
>>>> Jason Baron was working on this (search LKML archives for
>>>> EPOLLEXCLUSIVE, EPOLLROUNDROBIN, EPOLL_ROTATE)
>>>> 
>>>> However, I was unconvinced about modifying epoll.
>>>> 
>>>> Perhaps I may be more easily convinced about your mqueue case than 
>>>> his
>>>> case for listen sockets, though[*]
>>>> 
>>> 
>>> Yeah, so I implemented an 'EPOLL_ROTATE' mode, where you could have
>>> multiple epoll fds (or epoll sets) attached to the same wakeup 
>>> source,
>>> and have the wakeups 'rotate' among the epoll sets. The wakeup
>>> essentially walks the list of waiters, wakes up the first thread
>>> that is actively in epoll_wait(), stops and moves the woken up
>>> epoll set to the end of the list. So it attempts to balance
>>> the wakeups among the epoll sets, I think in the way that you
>>> were describing.
>>> 
>>> Here is the patchset:
>>> 
>>> https://lkml.org/lkml/2015/2/24/667
>>> 
>>> The test program shows how to use the API. Essentially, you
>>> have to create a 'dummy' epoll fd with the 'EPOLL_ROTATE' flag,
>>> which you then attach to you're shared wakeup source and
>>> then to your epoll sets. Please let me know if its unclear.
>>> 
>>> Thanks,
>>> 
>>> -Jason
>> 
>> In my particular case I need to work with multiple 
>> processes/executables running (not threads) and listening on same 
>> queues (this concept allows to sysadmin easily manage those processes 
>> (start new ones for balancing or stop them with out service 
>> interruption), and if any process dies for some reason (signal, core, 
>> etc..), the whole application does not get killed, but only one 
>> transaction is lost).
>> 
>> Recently I did tests, and found out that kernel's epoll currently 
>> sends notifications to 4 processes (I think it is EP_MAX_NESTS 
>> constant) waiting on same resource (those other 6 from my example will 
>> stay in sleep state). So it is not as bad as I thought before. It 
>> could be nice if EP_MAX_NESTS could be configurable, but I guess 4 is 
>> fine too.
>> 
> 
> hmmm...EP_MAX_NESTS is about the level 'nesting' epoll sets, IE
> if you can do ep1->ep2->ep3->ep4-> <wakeup src fd>. But you
> can't add in 'ep5'. Where the 'epN' above represent epoll file
> descriptors that are attached together via: EPOLL_CTL_ADD.
> 
> The nesting does not affect how wakeups are down. All epoll fds
> that are attached to the even source fd are going to get wakeups.
> 
> 
>> Jason, does your patch work for multi-process application? How hard it 
>> would be to implement this for such scenario?
> 
> I don't think it would be too hard, but it requires:
> 
> 1) adding the patches
> 2) re-compiling, running new kernel
> 3) modifying your app to the new API.
> 
> Thanks,
> 
> -Jason
> 
> 
>> 
>> Madars
>> 
>>> 
>>>> Typical applications have few (probably only one) listen sockets or
>>>> POSIX mqueues; so I would rather use dedicated threads to issue
>>>> blocking syscalls (accept4 or mq_timedreceive).
>>>> 
>>>> Making blocking syscalls allows exclusive wakeups to avoid 
>>>> thundering
>>>> herds.
>>>> 
>>>>> How do you think, would it be real to implement this? How about
>>>>> concurrency?
>>>>> Can you please give me some hints from which points in code to 
>>>>> start
>>>>> to implement these changes?
>>>> 
>>>> For now, I suggest dedicating a thread in each process to do
>>>> mq_timedreceive/mq_receive, assuming you only have a small amount
>>>> of queues in your system.
>>>> 
>>>> 
>>>> [*] mq_timedreceive may copy a largish buffer which benefits from
>>>>     staying on the same CPU as much as possible.
>>>>     Contrary, accept4 only creates a client socket.  With a C10K+
>>>>     socket server (e.g. http/memcached/DB), a typical new client
>>>>     socket spends a fair amount of time idle.  Thus I don't believe
>>>>     memory locality inside the kernel is much concern when there's
>>>>     thousands of accepted client sockets.
>>>> 

^ permalink raw reply	[flat|nested] 11+ messages in thread